Database Systems the Complete Book C1
Database Systems the Complete Book C1
s
Systems
o
ic
m
Databases today are essential to every business. Whenever you visit a major
Web site — Google, Yahoo!, Amazon.com, or thousands of smaller sites that
é
provide information — there is a database behind the scenes serving up the
ad
information you request. Corporations maintain all their important records in
databases. Databases are likewise found at the core of many scientific investi-
gations. They represent the data gathered by astronomers, by investigators of
ac
safely. These systems are among the most complex types of software available.
1. Allow users to create new databases and specify their schemas (logical
op
2. Give users the ability to query the data (a “query” is database lingo for
a question about the data) and modify the data, using an appropriate
language, often called a query language or data-manipulation language.
3. Support the storage of very large amounts of data — many terabytes or
more — over a long period of time, allowing efficient access to the data
for queries and database modifications.
4. Enable durability, the recovery of the database in the face of failures,
errors of many kinds, or intentional misuse.
5. Control access to data from many users at once, without allowing unex-
pected interactions among users (called isolation) and without actions on
s
the data to be performed partially but not completely (called atomicity).
o
ic
1.1 Early Database Management Systems
The first commercial database management systems appeared in the late 1960’s.
m
These systems evolved from file systems, which provide some of item (3) above;
file systems store data over a long period of time, and they allow the storage of
é
large amounts of data. However, file systems do not generally guarantee that
ad
data cannot be lost if it is not backed up, and they don’t support efficient access
to data items whose location in a particular file is not known.
Further, file systems do not directly support item (2), a query language for
the data in files. Their support for (1) — a schema for the data — is limited to
ac
the creation of directory structures for files. Item (4) is not always supported
by file systems; you can lose data that has not been backed up. Finally, file
systems do not satisfy (5). While they allow concurrent access to files by several
es
users or processes, a file system generally will not prevent situations such as
two users modifying the same file at about the same time, so the changes made
by one user fail to appear in the file.
fin
The first important applications of DBMS’s were ones where data was com-
posed of many small items, and many queries or modifications were made.
Examples of these applications are:
n
2
THE WORLDS OF DATABASE SYSTEMS
s
Following a famous paper written by Ted Codd in 1970,2 database systems
o
changed significantly. Codd proposed that database systems should present
ic
the user with a view of data organized as tables called relations. Behind the
scenes, there might be a complex data structure that allowed rapid response to
é m
a variety of queries. But, unlike the programmers for earlier database systems,
the programmer of a relational system would not be concerned with the storage
structure. Queries could be expressed in a very high-level language, which
greatly increased the efficiency of database programmers. SQL (“Structured
ad
Query Language”) is the most important query language based on the relational
model.
By 1990, relational database systems were the norm. Yet the database field
ac
continues to evolve, and new issues and approaches to the management of data
surface regularly. Object-oriented features have infilrated the relational model.
Some of the largest databases are organized rather differently from those using
es
computers. The size was necessary, because to store a gigabyte of data required
a large computer system. Today, hundreds of gigabytes fit on a single disk,
co
Another important trend is the use of documents, often tagged using XML
(eXtensible Modeling Language). Large collections of small documents can
op
1 CODASYL Data Base Task Group April 1971 Report, ACM, New York.
2 Codd,
E. F., “A relational model for large shared data banks,” Comm. ACM, 13:6,
C
3
THE WORLDS OF DATABASE SYSTEMS
serve as a database, and the methods of querying and manipulating them are
different from those used in relational systems.
1. Google holds petabytes of data gleaned from its crawl of the Web. This
data is not held in a traditional DBMS, but in specialized structures
s
optimized for search-engine queries.
o
2. Satellites send down petabytes of information for storage in specialized
ic
systems.
m
3. A picture is actually worth way more than a thousand words. You can
store 1000 words in five or six thousand bytes. Storing a picture typi-
cally takes much more space. Repositories such as Flickr store millions
é
of pictures and support search of those pictures. Even a database like
ad
Amazon’s has millions of pictures of products to serve.
4. And if still pictures consume space, movies consume much more. An hour
ac
puters to store and distribute data of various kinds. Although each node
in the network may only store a few hundred gigabytes, together the
database they embody is enormous.
fin
many divisions. Each division may have built its own database of products
or employee records independently of other divisions. Perhaps some of these
divisions used to be independent companies, which naturally had their own way
ia
of doing things. These divisions may use different DBMS’s and different struc-
tures for information. They may use different terms to mean the same thing or
op
the same term to mean different things. To make matters worse, the existence
of legacy applications using each of these databases makes it almost impossible
to scrap them, ever.
C
4
THE WORLDS OF DATABASE SYSTEMS
distributed among them. One popular approach is the creation of data ware-
houses, where information from many legacy databases is copied periodically,
with the appropriate translation, to a central database. Another approach is
the implementation of a mediator, or “middleware,” whose function is to sup-
port an integrated model of the data of the various databases, while translating
between this model and the actual models used by each database.
s
In Fig. 1 we see an outline of a complete DBMS. Single boxes represent system
components, while double boxes represent in-memory data structures. The solid
o
lines indicate control and data flow, while dashed lines indicate data flow only.
ic
Since the diagram is complicated, we shall consider the details in several stages.
First, at the top, we suggest that there are two distinct sources of commands
m
to the DBMS:
1. Conventional users and application programs that ask for data or modify
é
data.
ad
2. A database administrator : a person or persons responsible for the struc-
ture or schema of the database.
ac
beginning at the upper right side of Fig. 1. For example, the database admin-
istrator, or DBA, for a university registrar’s database might decide that there
should be a table or relation with columns for a student, a course the student
fin
has taken, and a grade for that student in that course. The DBA might also
decide that the only allowable grades are A, B, C, D, and F. This structure
and constraint information is all part of the schema of the database. It is
shown in Fig. 1 as entered by the DBA, who needs special authority to exe-
n
cute schema-altering commands, since these can have profound effects on the
co
The great majority of interactions with the DBMS follow the path on the left
side of Fig. 1. A user or an application program initiates some action, using
C
the data-manipulation language (DML). This command does not affect the
schema of the database, but may affect the content of the database (if the
5
THE WORLDS OF DATABASE SYSTEMS
Database
User/application administrator
queries, transaction DDL
updates commands commands
Query Transaction DDL
manager
s
compiler compiler
metadata,
o
query metadata
plan statistics
ic
Execution Logging and Concurrency
engine recovery control
commands indexes
Buffer
manager Buffers
es
read/write
pages
fin
Storage
manager
n
co
Storage
6
THE WORLDS OF DATABASE SYSTEMS
action is a modification command) or will extract data from the database (if the
action is a query). DML statements are handled by two separate subsystems,
as follows.
s
data files quickly.
o
The requests for data are passed to the buffer manager. The buffer man-
ager’s task is to bring appropriate portions of the data from secondary storage
ic
(disk) where it is kept permanently, to the main-memory buffers. Normally, the
page or “disk block” is the unit of transfer between buffers and disk.
m
The buffer manager communicates with a storage manager to get data from
disk. The storage manager might involve operating-system commands, but
é
more typically, the DBMS issues commands directly to the disk controller.
ad
Transaction Processing
Queries and other DML actions are grouped into transactions, which are units
ac
that must be executed atomically and in isolation from one another. Any query
or modification action can be a transaction by itself. In addition, the execu-
tion of transactions must be durable, meaning that the effect of any completed
es
transaction must be preserved even if the system fails in some way right after
completion of the transaction. We divide the transaction processor into two
major parts:
fin
perform any useful operation on data, that data must be in main memory. It
is the job of the storage manager to control the placement of data on disk and
its movement between disk and main memory.
C
7
THE WORLDS OF DATABASE SYSTEMS
purposes, DBMS’s normally control storage on the disk directly, at least under
some circumstances. The storage manager keeps track of the location of files
on the disk and obtains the block or blocks containing a file on request from
the buffer manager.
The buffer manager is responsible for partitioning the available main mem-
ory into buffers, which are page-sized regions into which disk blocks can be
transferred. Thus, all DBMS components that need information from the disk
will interact with the buffers and the buffer manager, either directly or through
the execution engine. The kinds of information that various components may
need include:
o s
2. Metadata: the database schema that describes the structure of, and con-
straints on, the database.
ic
3. Log Records: information about recent changes to the database; these
support durability of the database.
m
4. Statistics: information gathered and stored by the DBMS about data
é
properties such as the sizes of, and values in, various relations or other
ad
components of the database.
the database to some consistent state. The log manager initially writes
the log in buffers and negotiates with the buffer manager to make sure that
op
buffers are written to disk (where data can survive a crash) at appropriate
times.
C
8
THE WORLDS OF DATABASE SYSTEMS
• “I” stands for “isolation,” the fact that each transaction must appear
to be executed as if no other transaction is executing at the same
time.
s
• “D” stands for “durability,” the condition that the effect on the
o
database of a transaction must never be lost, once the transaction
ic
has completed.
The remaining letter, “C,” stands for “consistency.” That is, all databases
m
have consistency constraints, or expectations about relationships among
data elements (e.g., account balances may not be negative after a trans-
é
action finishes). Transactions are expected to preserve the consistency of
ad
the database.
ac
locks that the scheduler grants, they can get into a situation where none
can proceed because each needs something another transaction has. The
transaction manager has the responsibility to intervene and cancel (“roll-
back” or “abort”) one or more transactions to let the others proceed.
ia
op
9
THE WORLDS OF DATABASE SYSTEMS
1. The query compiler, which translates the query into an internal form
called a query plan. The latter is a sequence of operations to be performed
on the data. Often the operations in a query plan are implementations
of “relational algebra” operations. The query compiler consists of three
major units:
(a) A query parser, which builds a tree structure from the textual form
of the query.
(b) A query preprocessor, which performs semantic checks on the query
(e.g., making sure all relations mentioned by the query actually
exist), and performing some tree transformations to turn the parse
tree into a tree of algebraic operators representing the initial query
s
plan.
o
(c) A query optimizer, which transforms the initial query plan into the
ic
best available sequence of operations on the actual data.
m
The query compiler uses metadata and statistics about the data to decide
which sequence of operations is likely to be the fastest. For example, the
existence of an index, which is a specialized data structure that facilitates
é
access to data, given values for one or more components of that data, can
ad
make one plan much faster than another.
2. The execution engine, which has the responsibility for executing each of
the steps in the chosen query plan. The execution engine interacts with
ac
accessing data that is locked, and with the log manager to make sure that
all database changes are properly logged.
fin
n
co
ia
op
C
10
THE WORLDS OF DATABASE SYSTEMS
3 References
Today, on-line searchable bibliographies cover essentially all recent papers con-
cerning database systems. Thus, we shall not try to be exhaustive in our cita-
tions, but rather shall mention only the papers of historical importance and
major secondary sources or useful surveys. A searchable index of database
research papers was constructed by Michael Ley [5], and has recently been
expanded to include references from many fields. Alf-Christian Achilles main-
tains a searchable directory of many indexes relevant to the database field [3].
While many prototype implementations of database systems contributed to
the technology of the field, two of the most widely known are the System R
project at IBM Almaden Research Center [4] and the INGRES project at Berke-
s
ley [7]. Each was an early relational system and helped establish this type of
o
system as the dominant database technology. Many of the research papers that
shaped the database field are found in [6].
ic
The 2003 “Lowell report” [1] is the most recent in a series of reports on
database-system research and directions. It also has references to earlier reports
of this type.
m
You can find more about the theory of database systems than is covered
here from [2] and [8].
é
ad
1. S. Abiteboul et al., “The Lowell database research self-assessment,” Comm.
ACM 48:5 (2005), pp. 111–118. http://research.microsoft.com/˜gray
ac
/lowell/LowellDatabaseResearchSelfAssessment.htm
3. http://liinwww.ira.uka.de/bibliography/Database .
fin
n
co
ia
op
C
11
THE WORLDS OF DATABASE SYSTEMS
5. http://www.informatik.uni-trier.de/˜ley/db/index.html . A mir-
ror site is found at http://www.acm.org/sigmod/dblp/db/index.html .
s
8. J. D. Ullman, Principles of Database and Knowledge-Base Systems, Vol-
o
umes I and II, Computer Science Press, New York, 1988, 1989.
ic
é m
ad
ac
es
fin
n
co
ia
op
C
12