DDBMS Handbook (MSE 1)
DDBMS Handbook (MSE 1)
(2170714)
Hand Book
Year: 2020-21
CE & IT Department
A database is an ordered collection of related data that is built for a specific purpose. A
database may be organized as a collection of multiple tables, where a table represents a real
world element or entity. Each table has several different fields that represent the characteristic
features of the entity.
For example, a company database may include tables for projects, employees, departments,
products and financial records. The fields in the Employee table may be Name, Company_Id,
Date_of_Joining, and so forth.
MySQL
Oracle
SQL Server
dBASE
FoxPro
PostgreSQL, etc.
Database Schema
A database schema is a description of the database which is specified during database
design and subject to infrequent alterations.
It defines the organizationof the data, the relationships among them, and the constraints
associated with them.
Databases are often represented through the three-schema architecture or ANSISPARC
architecture. The goal of this architecture is to separate the user application from the
physical database. The three levels are −
Internal Level having Internal Schema − It describes the physical structure, details of
internal storage and access paths for the database.
Conceptual Level having Conceptual Schema − It describes the structure of the whole
database while hiding the details of physical storage of data. This illustrates the entities,
attributes with their data types and constraints, user operations and relationships.
External or View Level having External Schemas or Views − It describes the portion of a
database relevant to a particular user or a group of users while hiding the rest of database.
Types of DBMS
There are four types of DBMS.
Hierarchical DBMS
In hierarchical DBMS, the relationships among data in the database are established so that one
data element exists as a subordinate of another. The data elements have parent-child relationships
and are modelled using the “tree” data structure. These are very fast and simple.
Relational DBMS
In relational databases, the database is represented in the form of relations. Each relation models
an entity and is represented as a table of values. In the relation or table, a row is called a tuple
and denotes a single record. A column is called a field or an attribute and denotes a characteristic
property of the entity. RDBMS is the most popular database management system.
Distributed DBMS
A distributed database is a set of interconnected databases that is distributed over the computer
network or internet. A Distributed Database Management System (DDBMS) manages the
distributed database and provides mechanisms so as to make the databases transparent to the
users. In these systems, data is intentionally distributed among multiple nodes so that all
computing resources of the organization can be optimally used.
M operations on a DBMS
The four basic operations on a database are Create, Retrieve, Update and Delete.
CREATE database structure and populate it with data − Creation of a database relation
involves specifying the data structures, data types and the constraints of the data to be
stored.
NAME VARCHAR2(25),
YEAR INTEGER,
Once the data format is defined, the actual data is stored in accordance with the format in
some storage medium.
Example SQL command to insert a single tuple into the student table −
Example − To retrieve the names of all students of the Computer Science stream, the
following SQL query needs to be executed −
UPDATE information stored and modify database structure – Updating a table involves
changing old values in the existing table’s rows with new values.
UPDATE STUDENT
Modifying database means to change the structure of the table. However, modification of
the table is subject to a number of restrictions.
Example − To add a new field or column, say address to the Student table, we use the
following SQL command −
Example − To delete all students who are in 4th year currently when they are passing
out, we use the SQL command −
Example − To remove the student table completely, the SQL command used is −
This chapter introduces the concept of DDBMS. In a distributed database, there are a number of
databases that may be geographically distributed all over the world. A distributed DBMS
manages the distributed database in a manner so that it appears as one single database to users.
In the later part of the chapter, we go on to study the factors that lead to distributed databases,
its advantages and disadvantages.
Distributed DBMS
A distributed database is a collection of multiple interconnected databases, which are spread physically
across various locations that communicate via a computer network.
1) Databases in the collection are logically interrelated with each other. Often they represent a
single logical database.
2) Data is physically stored across multiple sites. Data in each site can be managed by a DBMS
independent of the other sites.
3) The processors in the sites are connected via a network. They do not have any multiprocessor
configuration.
A distributed database management system (DDBMS) is a centralized software system that manages a
distributed database in a manner as if it were all stored in a single location.
Features
1) It is used to create, retrieve, update and delete distributed databases.
2) It synchronizes the database periodically and provides access mechanisms by the virtue of
which the distribution becomes transparent to the users.
4) It is used in application areas where large volumes of data are processed and accessed by
numerous users simultaneously.
Distributed Nature of Organizational Units − Most organizations in the current times are
subdivided into multiple units that are physically distributed over the globe. Each unit requires
its own set of local data. Thus, the overall database of the organization becomes distributed.
Need for Sharing of Data − The multiple organizational units often need to communicate with
each other and share their data and resources. This demands common databases or replicated
databases that should be used in a synchronized manner.
Support for Both OLTP and OLAP − Online Transaction Processing (OLTP) and Online
Analytical Processing (OLAP) work upon diversified systems which may have common data.
Distributed database systems aid both these processing by providing synchronized data.
Database Recovery − One of the common techniques used in DDBMS is replication of data
across different sites. Replication of data automatically helps in data recovery if database in any
Support for Multiple Application Software − Most organizations use a variety of application
software each with its specific database support. DDBMS provides a uniform functionality for
using the same data among different platforms.
Modular Development − If the system needs to be expanded to new locations or new units, in
centralized database systems, the action requires substantial efforts and disruption in the existing
functioning. However, in distributed databases, the work simply requires adding new computers
and local data to the new site and finally connecting them to the distributed system, with no
interruption in current functions.
More Reliable − In case of database failures, the total system of centralized databases comes to a
halt. However, in distributed systems, when a component fails, the functioning of the system
continues may be at a reduced performance. Hence DDBMS is more reliable.
Better Response − If data is distributed in an efficient manner, then user requests can be met
from local data itself, thus providing faster response. On the other hand, in centralized systems,
all queries have to pass through the central computer for processing, which increases the
response time.
Lower Communication Cost − In distributed database systems, if data is located locally where it
is mostly used, then the communication costs for data manipulation can be minimized. This is
not feasible in centralized systems.
Need for complex and expensive software − DDBMS demands complex and often
expensive software to provide data transparency and co-ordination across the several sites.
Data integrity − The need for updating data in multiple sites pose problems of data
integrity.
In this part of the tutorial, we will study the different aspects that aid in designing distributed
database environments. This chapter starts with the types of distributed databases. Distributed
databases can be classified into homogeneous and heterogeneous databases having further
divisions. The next section of this chapter discusses the distributed architectures namely client –
server, peer – to – peer and multi – DBMS. Finally, the different design alternatives like
replication and fragmentation are introduced.
2) The sites use identical DBMS or DBMS from the same vendor.
3) Each site is aware of all other sites and cooperates with other sites to process user requests.
Autonomous − Each database is independent that functions on its own. They are integrated by a
controlling application and use message passing to share data updates.
Non-autonomous − Data is distributed across the homogeneous nodes and a central or master
DBMS co-ordinates data updates across the sites.
2) The system may be composed of a variety of DBMSs like relational, network, hierarchical or
object oriented.
5) A site may not be aware of other sites and so there is limited co-operation in processing user
requests.
Un-federated − The database systems employ a central coordinating module through which the
databases are accessed.
Distribution − It states the physical distribution of data across the different sites.
Autonomy − It indicates the distribution of control of the database system and the degree to
which each constituent DBMS can operate independently.
RDBMS stands for Relational Database Management System. RDBMS is the basis for SQL, and
for all modern database systems like MS SQL Server, IBM DB2, Oracle, MySQL, and Microsoft
Access.
The data in an RDBMS is stored in database objects which are called as tables. This table is
basically a collection of related data entries and it consists of numerous columns and rows.
Remember, a table is the most common and simplest form of data storage in a relational
database. The following program is an example of a EMPLOYEE table −
+----+----------+-----+-----------+----------+
+----+----------+-----+-----------+----------+
| 6 | Komal | 22 | MP | 4500.00 |
+----+----------+-----+-----------+----------+
Every table is broken up into smaller entities called fields. The fields in the CUSTOMERS
table consist of ID, NAME, AGE, ADDRESS and SALARY.
A field is a column in a table that is designed to maintain specific information about every
record in the table.
For example, a column in the CUSTOMERS table is ADDRESS, which represents location
description
NULL value
A NULL value in a table is a value in a field that appears to be blank, which means a field with a
NULL value is a field with no value.
It is very important to understand that a NULL value is different than a zero value or a field that
contains spaces. A field with a NULL value is the one that has been left blank during a record
creation.
Constraints
Constraints are the rules enforced on data columns on a table. These are used to limit the type of
data that can go into a table. This ensures the accuracy and reliability of the data in the database.
Constraints can either be column level or table level. Column level constraints are applied only
to one column whereas, table level constraints are applied to the entire table.
Following are some of the most commonly used constraints available in SQL
NOT NULL Constraint − Ensures that a column cannot have a NULL value.
DEFAULT Constraint − Provides a default value for a column when none is specified.
UNIQUE Constraint − Ensures that all the values in a column are different.
CHECK Constraint − The CHECK constraint ensures that all values in a column satisfy
certain conditions.
INDEX − Used to create and retrieve data from the database very quickly.
Domain Integrity − Enforces valid entries for a given column by restricting the type, the format,
or the range of values.
Referential integrity − Rows cannot be deleted, which are used by other records.
User-Defined Integrity − Enforces some specific business rules that do not fall into entity,
domain or referential integrity.
Database normalization
Database normalization is the process of efficiently organizing data in a database. There are two
reasons of this normalization process −
Eliminating redundant data, for example, storing the same data in more than one table.
Both these reasons are worthy goals as they reduce the amount of space a database consumes and
ensures that data is logically stored. Normalization consists of a series of guidelines that help
guide you in creating a good database structure.
If a database design is not perfect, it may contain anomalies, which are like a bad dream for any
database administrator. Managing a database with anomalies is next to impossible.
anomalies
Update anomalies − If data items are scattered and are not linked to each other properly,
then it could lead to strange situations. For example, when we try to update one data item
having its copies scattered over several places, a few instances get updated properly while
a few others are left with old values. Such instances leave the database in an inconsistent
state.
Insert anomalies − We tried to insert data in a record that does not exist at all.
Normalization
It is a method to remove all these anomalies and bring the database to a consistent state.
Normalization guidelines are divided into normal forms; think of a form as the format or the way
a database structure is laid out. The aim of normal forms is to organize the database structure, so
that it complies with the rules of first normal form, then second normal form and finally the third
normal form.
It is your choice to take it further and go to the fourth normal form, fifth normal form and so on,
but in general, the third normal form is more than enough.
First Normal Form is defined in the definition of relations (tables) itself. This rule defines that all
the attributes in a relation must have atomic domains. The values in an atomic domain are
indivisible units.
Each attribute must contain only a single value from its pre-defined domain.
If we follow second normal form, then every non-prime attribute should be fully functionally
dependent on prime key attribute. That is, if X → A holds, then there should not be any proper
subset Y of X, for which Y → A also holds true.
We broke the relation in two as depicted in the above picture. So there exists no partial
dependency.
For a relation to be in Third Normal Form, it must be in Second Normal form and the following
must satisfy −
X is a superkey or,
A is prime attribute.
To bring this relation into third normal form, we break the relation into two relations as follows −
In the above image, Stu_ID is the super-key in the relation Student_Detail and Zip is the super-
key in the relation ZipCodes. So,
and
Zip → City
Relational algebra defines the basic set of operations of relational database model. A sequence
of relational algebra operations forms a relational algebra expression. The result of this
expression represents the result of a database query.
Projection
Selection
Union
Intersection
Minus
Join
Projection
Projection operation displays a subset of fields of a table. This gives a vertical partition of the
table.
π<AttributeList>(<TableName>)
For example, let us consider the following Student database –
STUDENT
πName,Course(STUDENT)
Selection
Selection operation displays a subset of tuples of a table that satisfies certain conditions. This
gives a horizontal partition of the table.
σ<Conditions>(<TableName>)
For example, in the Student table, if we want to display the details of all students who have
opted for MCA course, we will use the following relational algebra expression −
σCourse="mCA"(STUDENT)
Combination of Projection and Selection Operations
For most queries, we need a combination of projection and selection operations. There are two
ways to write these expressions −
For example, to display names of all female students of the BCA course −
πName(σGender="Female"ANDCourse="BCA"(STUDENT))
FemaleBCAStudent←σGender="Female"ANDCourse="BCA"(STUDENT)
Result←πName(FemaleBCAStudent)
Union
If P is a result of an operation and Q is a result of another operation, the union of P and Q
(p∪ Qp∪ Q) is the set of all tuples that is either in P or in Q or in both without duplicates.
Sem1Student←σSemester=1(STUDENT)
BCAStudent←σCourse="BCA"(STUDENT)
Result←Sem1Student∪BCAStudent
Intersection
If P is a result of an operation and Q is a result of another operation, the intersection of P and Q
( p∩ Qp∩ Q ) is the set of all tuples that are in P and Q both.
EMPLOYEE
PROJECT
To display the names of all cities where a project is located and also an employee resides −
CityEmp←πCity(EMPLOYEE)
CityProject←πCity(PROJECT)
Result←CityEmp∩CityProject
Minus
If P is a result of an operation and Q is a result of another operation, P - Q is the set of all tuples
that are in P and not in Q.
For example, to list all the departments which do not have an ongoing project (projects with status
= ongoing) –
AllDept←πDepartment(EMPLOYEE)
ProjectDept←πDepartment(σStatus="ongoing"(PROJECT)) Result←AllDept−ProjectDept
Join
Join operation combines related tuples of two different tables (results of queries) into a single
table.
BRANCH
SQL queries are translated into equivalent relational algebra expressions before optimization. A
query is at first decomposed into smaller query blocks. These blocks are translated to equivalent
relational algebra expressions. Optimization includes optimization of each block and then
optimization of the query as a whole.
Examples
Let us consider the following schemas −
EMPLOYEE
PROJECT
WORKS
Example 1
To display the details of all employees who earn a salary LESS than the average salary, we write
the SQL query −
This query contains one nested sub-query. So, this can be broken down into two blocks.
AvgSal←IAVERAGE(Salary)EMPLOYEE
σSalary<AvgSal>EMPLOYEE
Example 2
To display the project ID and status of all projects of employee 'Arun Kumar', we write the SQL
query −
WHERE PID =( SELECT FROM WORKS WHERE EMPID =( SELECT EMPID FROM EMPLOYEE
This query contains two nested sub-queries. Thus, can be broken down into three blocks, as f
follows −
ArunEmpID←πEmpID(σName="ArunKumar"(EMPLOYEE))
ArunPID←πPID(σEmpID="ArunEmpID"(WORKS)) Result←πPID,Status(σPID="ArunPID"(PROJECT))
Multi-database View Level − Depicts multiple user views comprising of subsets of the
integrated distributed database.
Multi-database Internal Level − Depicts the data distribution across different sites and multi-
database to local data mapping.
Local database Conceptual Level − Depicts local data organization at each site.
Local database Internal Level − Depicts physical data organization at each site.
The distribution design alternatives for the tables in a DDBMS are as follows −
Fully replicated
Partially replicated
Fragmented
Mixed
Fully Replicated
In this design alternative, at each site, one copy of all the database tables is stored. Since, each
site has its own copy of the entire database, queries are very fast requiring negligible
communication cost. On the contrary, the massive redundancy in data requires huge cost during
update operations. Hence, this is suitable for systems where a large number of queries is
required to be handled whereas the number of database updates is low.
Partially Replicated
Copies of tables or portions of tables are stored at different sites. The distribution of the tables is
done in accordance to the frequency of access. This takes into consideration the fact that the
frequency of accessing the tables vary considerably from site to site. The number of copies of the
tables (or portions) depends on how frequently the access queries execute and the site which
generate the access queries.
Fragmented
In this design, a table is divided into two or more pieces referred to as fragments or partitions,
and each fragment can be stored at different sites. This considers the fact that it seldom
happens that all data stored in a table is required at a given site. Moreover, fragmentation
increases parallelism and provides better disaster recovery. Here, there is only one copy of each
fragment in the system, i.e. no redundant data.
Vertical fragmentation
Horizontal fragmentation
Hybrid fragmentation
In the last chapter, we had introduced different design alternatives. In this chapter, we will study
the strategies that aid in adopting the designs. The strategies can be broadly divided into
replication and fragmentation. However, in most cases, a combination of the two is used.
Replication
Data replication is the process of storing separate copies of the database at two or more sites. It
is a popular fault tolerance technique of distributed databases.
Reduction in Network Load − Since local copies of data are available, query processing can be
done with reduced network usage, particularly during prime hours. Data updating can be done
at non-prime hours.
Quicker Response − Availability of local copies of data ensures quick query processing and
consequently quick response time.
Simpler Transactions − Transactions require less number of joins of tables located at different
sites and minimal coordination across the network. Thus, they become simpler in nature.
Increased Cost and Complexity of Data Updating − Each time a data item is updated, the
update needs to be reflected in all the copies of the data at the different sites. This requires
complex synchronization techniques and protocols.
Undesirable Application – Database coupling − If complex update mechanisms are not used,
removing data inconsistency requires complex co-ordination at application level. This results
in undesirable application – database coupling.
Snapshot replication
Near-real-time replication
Pull replication
Fragmentation
Fragmentation is the task of dividing a table into a set of smaller tables. The subsets of the table are
called fragments.
horizontal,
vertical, and
Fragmentation should be done in a way so that the original table can be reconstructed from the
fragments. This is needed so that the original table can be reconstructed from the fragments whenever
required. This requirement is called “reconstructiveness.”
Advantages of Fragmentation
Since data is stored close to the site of usage, efficiency of the database system is increased.
Local query optimization techniques are sufficient for most queries since data is locally available.
Since irrelevant data is not available at the sites, security and privacy of the database system
can be maintained.
Disadvantages of Fragmentation
When data from different fragments are required, the access speeds may be very high.
In case of recursive fragmentations, the job of reconstruction will need expensive techniques.
Lack of back-up copies of data in different sites may render the database ineffective in case of
failure of a site.
In vertical fragmentation, the fields or columns of a table are grouped into fragments. In order
to maintain reconstructiveness, each fragment should contain the primary key field(s) of the
table. Vertical fragmentation can be used to enforce privacy of data.
For example, let us consider that a University database keeps records of all registered students
in a Student table having the following schema.
STUDENT
Now, the fees details are maintained in the accounts section. In this case, the designer will
fragment the database as follows −
SELECT Regd_No,Fees
FROM STUDENT;
Horizontal fragmentation
Horizontal fragmentation groups the tuples of a table in accordance to values of one or more
fields. Horizontal fragmentation should also confirm to the rule of reconstructiveness. Each
horizontal fragment must have all columns of the original base table.
For example, in the student schema, if the details of all students of Computer Science Course
needs to be maintained at the School of Computer Science, then the designer will horizontally
fragment the database as follows −
CREATE COMP_STD AS
hybrid fragmentation
At first, generate a set of horizontal fragments; then generate vertical fragments from one or
more of the horizontal fragments.
At first, generate a set of vertical fragments; then generate horizontal fragments from one or
more of the vertical fragments.
Transparency
Distribution transparency is the property of distributed databases by the virtue of which the
internal details of the distribution are hidden from the users. The DDBMS designer may choose
to fragment tables, replicate the fragments and store them at different sites. However, since
users are oblivious of these details, they find the distributed database easy to use like any
centralized database.
Location transparency
Fragmentation transparency
Replication transparency
Location transparency
Location transparency ensures that the user can query on any table(s) or fragment(s) of a table
as if they were stored locally in the user’s site. The fact that the table or its fragments are stored
at remote site in the distributed database system, should be completely oblivious to the end
user. The address of the remote site(s) and the access mechanisms are completely hidden.
In order to incorporate location transparency, DDBMS should have access to updated and
accurate data dictionary and DDBMS directory which contains the details of locations of data.
Fragmentation transparency enables users to query upon any table as if it were unfragmented.
Thus, it hides the fact that the table the user is querying on is actually a fragment or union of
some fragments. It also conceals the fact that the fragments are located at diverse sites.
This is somewhat similar to users of SQL views, where the user may not know that they are
using a view of a table instead of the table itself.
Replication transparency
Replication transparency ensures that replication of databases are hidden from the users. It
enables users to query upon a table as if only a single copy of the table exists.