0% found this document useful (0 votes)
226 views

DBMS Notes Unit IV PDF

This document discusses trends in database technology, including physical storage media like magnetic disks, RAID, and tertiary storage. It describes the storage hierarchy from fastest but smallest primary storage like cache and main memory, to slower but larger secondary storage like magnetic disks, to slowest but largest tertiary storage like tapes. Magnetic disks are described as the primary medium for long-term online storage and allow direct access to data through structures like tracks and sectors. Tertiary storage like tapes are used for backup and archival due to their low cost but slow sequential access.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
226 views

DBMS Notes Unit IV PDF

This document discusses trends in database technology, including physical storage media like magnetic disks, RAID, and tertiary storage. It describes the storage hierarchy from fastest but smallest primary storage like cache and main memory, to slower but larger secondary storage like magnetic disks, to slowest but largest tertiary storage like tapes. Magnetic disks are described as the primary medium for long-term online storage and allow direct access to data through structures like tracks and sectors. Tertiary storage like tapes are used for backup and archival due to their low cost but slow sequential access.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

TRENDS IN DATABASE TECHNOLOGY


Overview of Physical Storage Media – Magnetic Disks – RAID – Tertiary storage – File Organization –
Organization of Records in Files – Indexing and Hashing –Ordered Indices – B+ tree Index Files – B tree Index
Files – Static Hashing – Dynamic Hashing - Introduction to Distributed Databases- Client server technology-
Multidimensional and Parallel databases- Spatial and multimedia databases- Mobile and web databases- Data
Warehouse-Mining- Data marts.

4.1 Overview of Physical Storage Media:


Physical Storage Media:
Several types of data storage exist in most computer systems. These storage media are classified
by the speed with which data can be accessed, by cost per unit of data to buy the medium, and by the
medium’s reliability. Among the media typically available are these:
Primary Storage: Fastest media but volatile (cache, main memory).
Secondary Storage: next level in hierarchy, non-volatile, moderately fast access time
 also called on-line storage
 E.g. flash memory, magnetic disks
Tertiary Storage: lowest level in hierarchy, non-volatile, slow access time
 also called off-line storage
 E.g. magnetic tape, optical storage

Figure: 5.1 - Storage Hierarchy


Primary Storage
Cache: the cache is the fastest and most costly form of storage. Cache memory is small. Its use is
managed by the computer system hardware.
Main Memory: the storage medium used for data that are available to be operated on is main memory.
The general purpose machine instructions operate on main memory. Although main memory may contain
many megabytes of data, or even gigabytes of data in large server systems, it is generally too small
(or too expensive) for storing the entire database. The contents of main memory are usually lost if a
power failure or system crash occurs.
Secondary Storage
Page 1 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

Flash memory: Also known as electrically erasable programmable read-only memory (EEPROM),
flash memory differs from main memory in that data survives power failure. Reading data from
flash memory takes less than 100 nanoseconds (a nanosecond is 1/1000 of a microsecond), which is
roughly as fast as. reading data from main memory. However, writing data to flash memory is more
complicated. Data can be written once, which takes about 4 to 10 microseconds, but cannot be
overwritten directly. To overwrite memory that has been written already, we have to erase an entire
bank of memory at once; it is then ready to be written again. A drawback of flash memory is that it
can support only a limited number of erase cycles, ranging from 10,000 to 1 million. Flash memory
has found popularity as a replacement for magnetic disks for storing small volumes of data (5 to 10
megabytes) in low-cost computer systems, such as computer systems that are embedded in other
devices, in hand-held computers, and in other digital electronic devices such as digital cameras.
Magnetic-disk storage: The primary medium for the long-term on-line storage of data is the
magnetic disk. Usually, the entire database is stored on magnetic disk. The, system must move the
data from disk to main memory so that they can be accessed. After the system has performed the
designated operations, the data that have been modified must be written to disk. The size of
magnetic disks currently ranges from a few gigabytes to 80 gigabytes. Disk storage survives power
failures and system crashes. Disk-storage devices themselves may sometimes fail and thus destroy
data, but such failures usually occur much less frequently than do system crashes.
Tertiary Storage
Optical Storage: The most popular forms of optical storage are the compact disk (CD), which can
hold about 640 megabytes of data, and the digital video disk (DVD) which can hold 4.7 or 8.5
gigabytes of data per side of the disk (or up to 17 gigabytes on a two-sided disk). Data are stored
optically on a disk, and are read by a laser. The optical disks used in read-only compact disks (CD-
ROM) or read-only digital video disk (DVD-ROM) cannot be written, but are supplied with data
prerecorded. There are "record-once" versions of compact disk (called CD-R) and digital video disk
(called DVD-R), which can be written only once; such disks are also called write-once, read-many
(WORM) disks. There are also "multiple-write" versions of compact disk (called CD-RW) and
digital video disk (DVD-RW and DVD-RAM), which can be written multiple times. Recordable
compact disks are magnetic-optical storage devices that use optical means to read magnetically
encoded data. Such disks are useful for archival storage of data as well as distribution of data.
Tape storage: Tape storage is used primarily for backup and archival data. Although magnetic tape
is much cheaper than disks, access to data is much slower, because the tape must be accessed
sequentially from the beginning. For this reason, tape storage is referred to as sequential-access
storage. In contrast, disk storage is referred to as direct-access storage because it is possible to read
data from any location on disk. Tapes have a high capacity (40 gigabyte to 300 gigabytes tapes are
currently available), and can be removed from the tape drive, so they are well suited to cheap
archival storage. Tape jukeboxes are used to hold exceptionally large collections of data, such as
remote-sensing data from satellites, which could include as much as hundreds of terabytes.
The various storage media can be organized in a hierarchy as show in the Figure: 5.1, according
to their speed and their cost. The higher levels are expensive, but are fast. As we move down the
hierarchy, the cost per bit decreases, whereas the access time increases.
Page 2 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

Storage Access
The computer's ability to store and retrieve data is at the core of almost every business whether in
inventory management, word processing or management decision-making. The computer systems in
information processing greatly depend on how it stores data and how it retrieves the data. The secondary
storage devices can be categorized into two.
 Direct access storage
 Sequential access storage
Direct Access Storage
Direct storage requires an input/output device that is directly connected to the CPU. In other
words, it is said to be on-line, making the stored data available to the CPU at all times. Such storage
devices are called Direct Access Storage Devices (DASD). An example of direct access media is the hard
disk of the computer. Other examples are magnetic disks, optical disks, zip disks, etc.
4.2 Magnetic Disk:
One of the most popular and important secondary storage medium today is the magnetic disk.
Disk technology permits direct and immediate access to data. The computer can directly access a specific
record or piece of data on the disk instead of reading through the records one by one (as in the, case of
Sequential Storage Access Devices). It is because of this reason that the disks are often called Direct
Access Storage Devices (DASDs). Magnetic disks offer several advantages over magnetic tapes.
Individual records can be accessed much faster because they have a precise disk address. The main
drawback of the disk technology is the need for backup. In disk technology there is only one copy of the
information because the old information is over written when it is changed. So the information in the
magnetic disk should be baked up onto magnetic tapes or floppy disks. There are many kinds of magnetic
disks-solid or hard disks and flexible disks.
Hard Disks:
Magnetic hard disks are thin steel platters with an iron oxide coating. Several disks may be
mounted together on a vertical shaft on which they rotate at high speeds. Electromagnetic read/write
heads are mounted on access arms. The heads fly over the rotating disks and read or write data on
concentric circles called tracks. Data are, recorded on tracks as tiny magnetized spots forming binary
digits. Each track can store thousands of bytes. In most disk systems, each track contains the same
number of bytes with the data packed together more closely on the inner tracks. The read/write head
never actually touches the disk but hovers a few thousands or millionths an inch above it. A dust particle
or a human hair on the disk surface could cause the head to crash into the disk-an event known as head
crash. Access time to locate data on hard disks is 10-100 milliseconds, compared to 100~600 milliseconds
on floppy disks. Access time is the duration taken to complete a data transfer-from the time when the
computer requests data from secondary storage device, to the time when the transfer of data is complete.
Access time consists of four factors:
Seek time - The time it takes for an access arm to get into position over a particular track.
Head switching - The activation of a particular read/write head over a particular track and the surface.
Rotational delay - The time it takes for the particular record to be positioned under the read/write head.

Page 3 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

Data Transfer- The time it takes to transfer the data to the primary storage.

Figure: 5.2 - Magnetic Hard Disk Mechanism


Optical Disk
Optical disks are a storage medium from which data is read and to which it is written by lasers.
Optical disks can store much more data upto 6 gigabytes than magnetic media such as floppies and hard
disks. There are three basic types of optical disks:
CD-ROM - Like audio CDs, CD-ROMs come with data already encoded onto them. The data is
permanent and can be read any number of times, but CD-ROMs cannot be modified.
CD-R - This term stands for CD-Recordable. With a CD-R disk drive, you can write data onto a CD-R
disk, but only once. After that, the CD-R disk behaves just like a CD-ROM.
CD-RW- CD-RW disk is the abbreviation for CD-ReWritable disk and this is a type of CD disk that
enables you to write onto it in multiple sessions. One of the problems with CD-R disks is that we can
write to them only once. With CD-R W drives and disks, you can treat the optical disk just a floppy or
hard disk, writing data onto it multiple times.
DVD- DVD is the abbreviation for digital versatile disk or digital video disk and is a type of optical disk
technology similar to the CD-ROM. A DVD holds a minimum of 4.7 GB of data, enough for a full-length
movie. DVDs are commonly used as a medium for digital representation of movies and other multimedia
presentations that combine sound with graphics.
DVD-R- DVD-R stands for DVD-Recordable and is a recordable DVD format similar to CD-R. A DVD-
R can record data only once and then the data becomes permanent on the disk. The disk can not be
recorded onto a second time.
DVD-RW - DVD-RW stands for DVD-Rewritable, a rerecordable DVD format similar to "RW. The data
on a DVD-RW disk can be erased and recorded over numerous times without damaging the medium.
Magneto-Optical Drives:
This is a type of disk drive that combines magnetic disk technologies with CD-ROM technologies.
Both read and write can be performed in these drives and also they are removable. The storage capacity of
a magneto-optical disk can be more than 200 megabytes.
Sequential Access Storage
Sequential access storage is off-line. The data in this kind of storage is not accessible to the CPU
until it has been loaded onto an input device. An example of sequential access storage is the magnetic
tape.
Page 4 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

Magnetic Tape
Magnetic tape is a magnetically coated strip of plastic on which data can be encoded. Tapes for
computers are similar to the tapes used to store music. Storing data on tapes is considerably cheaper than
storing data on disks. Tapes also have large storage capacities, ranging from a few hindered kilobytes to
several gigabytes. Accessing data on tapes however is much slower than accessing data on disks. Tapes
are sequential access media, which means that to get to a particular point, on the tape, the tape must go
through all the preceding points.
The magnetic tapes consist of magnetically coated plastic tape round on reels. The devices on
which such tapes are run are called magnetic tape drives or magnetic tape units.. These devices look
similar to a refrigerator in size and shape and consist of two reels supply reel and a take-up reel. They also
have a read/write head and an erase head. The read/write head is an electromagnet that reads magnetized
areas, which represent data on the tape, converts them into electrical signals and sends them to the CPU.
It writes (or records) data onto the tape from the CPU. If there is previous data on the tape, it also erases
that data using the erase head, as it writes the new data.
Data is represented on the tape by invisible magnetized spots. The presence or absence
corresponds to a 1 bit or a 0 bit. The most common way of organizing data is a 9-track code. The tape is
divided into nine tracks or channels that run through the entire length of the tape. Each row across the 9
tracks represents a single EBCDIC or ASCII character plus a parity bit.
When data is written onto a tape, it is usually divided into logical records. Since the tapes need
some time for deceleration, it is not possible to stop a tape exactly where we want. So some room is
between records for stopping space. This space also allows time for the tape to accelerate as it starts.
These spaces are called Inter-record gaps (IRG).
4.3 RAID: Redundant Arrays of Independent Disks:
RAID provides disk organization techniques that manage a large numbers of disks, providing a
view of a single disk, which aims
 high reliability by storing data redundantly, so that data can be recovered even if a disk fails and
 high capacity and high speed by using multiple disks in parallel.
Improvement of reliability via Redundancy:
Let us first consider reliability. The chance that some disk out of a set of N disks will fail is much
higher than the chance that a specific single disk will fail. Suppose that the mean time to failure of a disk
is 100,000 hours, or slightly over 11 years. Then, the mean time to failure (MTTF) of some disk in an
array of 100 disks will be 100,000 / 100 = 1000 hours, or around 42 days, which is not long at all.
If only one copy of the data is stored, then each disk failure will result in loss of a significant
amount of data. Such a high rate of data loss is unacceptable. The solution to the problem of reliability is
to introduce redundancy; that is, to store extra information that can be used in the event of failure of a
disk to rebuild the lost information.
The simplest (but most expensive) approach to introducing redundancy is to duplicate every disk.
This technique is called mirroring (or, sometimes shadowing). A logical disk then consists of two
physical disks, and every write is carried out on both disks. If one of the disks fails, the data can be
read from the other. Data will be lost only if the second disk fails before the first failed disk is repaired.

Page 5 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

The mean time to failure (where failure is the loss of data) of a mirrored disk depends on the mean time to
failure of the individual disks, as well as on the mean time to repair, which is the time it takes (on an
average) to replace a failed disk and to restore the data on it. Power failures, and natural disasters such as
earthquakes, fires may result in damage to both disks at the same time. As disks age, the probability of,
failure increases, increasing the chance that a second disk will fail within the time the first disk is being
repaired. In spite .of all these considerations, however, mirrored disks offer much higher reliability than
do single-disk systems.
Improvement in Performance via Parallelism
Consider the benefit of paral1el access to multiple disks. With disk mirroring, the rate at which
read requests can be handled is doubled, since read requests can be sent to either disk. The transfer rate of
each read is the same as in a single-disk system but the number of reads per unit time has doubled.
With multiple disks, the transfer rate can be improved as well by striping data across multiple
disks. In its simplest form, data striping consists of splitting the bits of each byte across multiple
disks. Such striping is called bit-level striping. For example, if there is an array of eight disks, then bit i
of each byte is written to disk i. The array of eight disks can be treated as a single disk with sectors that
are eight times the normal size, and, more important, that has eight times the transfer rate.
In such an organization, every disk participates in every access (read or write), so number of
accesses that can be processed per second is about the same as on a single disk, but each access can read
eight times as many data in the same time as on a single disk.
Block-level striping stripes blocks across multiple disks. It treats the array of disks as a single
large disk, and it gives blocks logical numbers. It is assumed that the block numbers start from 0.
With an array of n disks, block-level striping assigns logical block i to the (i/n) th physical
block of the disk (i mod n) + 1. For example, with 8 disks, logical block 0 is stored in physical, block 0
of disk 1, while logical block 11 is stored in physical block 1 of disk 4.
While reading a large file, block-level striping fetches n blocks at a time in parallel from n disks,
giving a high data transfer rate for large reads. When a single block is read the data transfer rate is the
same as on one disk, but the remaining n - 1 disks is free to perform other actions.
RAID Levels
Mirroring provides high reliability, but it is expensive. Striping provides high data transfer rates,
but does not improve reliability. Various alternative schemes aim to provide redundancy at lower cost by
combining disk striping with "parity" bits. These schemes have different cost-performance trade-offs. The
schemes are classified into RAID levels.
RAID level 0 refers to disk arrays with striping at the level of blocks, but without any
redundancy. The below Figure: 5.3(a) shows an array of size 4.

Figure: 5.3 – (a) Raid Level 0 and (b) Raid Level 1


Page 6 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

RAID level l refers to disk mirroring with block striping. The above Figure: 5.3(b) shows a mirrored
organization that holds four disks worth of data.
RAID level 2 known as memory-style error-correcting-code (ECC) organization, employs parity bits.
Memory systems have long used parity bits for error detection and correction.
Usually each byte in a memory system may have a parity bit associated with it that records
whether the number of bits in the byte that are set to 1 is even (parity = 0) or odd (parity = 1). If one of
the bits in the byte gets damaged (either a 1 becomes a 0, or a 0 becomes a 1), the parity of the byte
changes and thus will not match the stored parity. Similarly, if the stored parity bit gets damaged, it will
not match the computed parity. Thus, all 1-bit errors will be detected by the memory system. Error-
correcting schemes store 2 or more extra bits, and can reconstruct the data if a single bit gets damaged.
The idea of error-correcting codes can be used directly in disk arrays by striping bytes across
disks. For example, the first bit of each byte could be stored in disk 1, the second bit in disk 2, and so on
until the eighth bit is stored in disk 8, and the error-correction bits are stored in further disks.
The below Figure: 5.4(c) shows the level 2 scheme. The disks labeled P store the error
correction bits. If one of the disks fails, the remaining bits of the byte and the associated error-correction
bits can be read from other disks, and can be used to reconstruct the damaged data. The RAID level 2
requires only three disks overhead for four disks of data, unlike RAID level 1, which required four disks
overhead.

Figure: 5.4 – (c) Raid Level 2 and (d) Raid Level 3


RAID level 3, bit-interleaved parity organization, improves on level 2 by exploiting the fact that disk
controllers, can detect whether a sector has been read correctly. So, single parity bit can be used for error
correction, as well as for detection. The idea is as follows. If one of the sectors gets damaged, the disk
controller knows exactly which sector it is, and, for, each bit in the sector, the system can figure out
whether it is a 1 or a 0 by computing the parity of the corresponding bits from sectors in the other
disks. If the parity of the remaining bits is equal to the stored parity, the missing bit is 0. Otherwise, it is
1. The Figure: 5.4(d) shows the RAID level 3.
RAID level 3 is as good as level 2, but is less expensive in the number of extra disks (it has only a
one-disk overhead), so level 2 is not used in practice.
RAID level 3 has two benefits over level 1. It needs only one parity disk for several regular disks,
whereas Level l needs one mirror disk for every disk, and thus reduces the storage overhead.
RAID level 4, block-interleaved parity organization, uses block level striping, like RAID 0, and in
addition keeps a parity block on a separate disk for corresponding blocks from N other disks. This
Page 7 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

scheme is shown pictorially in the Figure: 5.5(e). If one of the disks fails, the parity block can be used
with the corresponding blocks from the other disks to restore .the blocks of the failed disk.
A block read accesses only one disk, allowing other requests to be processed by the other disks.
Thus, the data-transfer rate for each access is slower, but multiple read accesses can proceed in parallel,
leading to a higher overall I/O rate. The transfer rates for large reads is high, since all the disks can be
read in parallel; large writes also have high transfer rates, since the data and parity can be written in
parallel.
Small independent writes, on the other hand, cannot be performed in parallel. A write of a block
has to access the disk on which the block is stored, as well as the parity disk, since the parity block has to
be updated. Moreover, both the old value of the parity block and the old value of the block being written
have to be read for the new parity to be computed.
Thus, a single write requires four disk accesses: two to read the two old blocks, and two to write the two
blocks.

Figure: 5.5(e) - Raid Level 4


RAID level 5, block-interleaved distributed parity, improves on level 4 by partitioning data and
parity among all N + 1 disks, instead of storing data in N disks and parity in one disk. In level 5, all
disks can participate in satisfying read requests, unlike RAID level 4, where the parity disk cannot
participate, so level 5 increases the total number of requests that can be met in a given amount of time.
For each set of N logical blocks, one of the disks stores the parity, and the other N disks store the blocks.
The Figure: 5.6(f) shows the setup. The P's are distributed across all the disks. The following table
indicates how the first 20 blocks, numbered 0 to 19, and their parity blocks are laid out. The pattern
shown gets repeated on further blocks.

Figure: 5.6(f) - Raid Level 5

Page 8 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

Figure: 5.7 – Distribution of Parity Block PK


RAID level 6, the P + Q redundancy scheme is much like RAID level 5. It stores extra redundant
information to guard against multiple disk fail Instead of using parity; level 6 uses error-correcting codes
such as the Reed Solomon codes. The scheme is shown in the Figure: 5.8(g) bits of redundant data are
stored for every 4 bits of data-unlike 1 parity in level 5 and the system can tolerate two disk failures.
.

Figure: 5.8(g) - Raid Level 6


Choice of RAID level
 The factors to be taken into account when choosing a RAID level are
 Monetary cost of extra disk storage requirements
 Performance requirements in terms of number of I/O operations
 Performance when a disk has failed
 Performance during rebuild (that is, while the data in a failed disk is being rebuilt on a new disk)
Rebuilding is easiest for RAID level l, since data can be copied from another disk; for the other
levels, we need to access all the other disks in the array to rebuild data of a failed disk. RAID level 0 is
used in high-performance applications where data safety is not critical. Since RAID levels 2 and 4 are
subsumed by RAID levels 3 and 5, the choice of RAID levels is restricted to the remaining levels. Bit
striping (level 3) is rarely used since block striping (level 5) gives as good data transfer rates for large
transfers, while using fewer disks for small transfers. The choice between RAID level 1 and level 5 is
harder to make. RAID level 1 is popular for applications such as storage of log files in a database system,
since it offers the best write performance. RAID level 5 has a lower storage overhead than level l, but has
a higher time overhead for writes. For applications where data are read frequently, and written rarely,
level 5 is the preferred choice.
Hardware Issues
Another issue in the choice of RAID implementations is at the level of hardware. RAID can be
implemented with no change at the hardware level, using only software modification. Such RAID

Page 9 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

implementations are called software RAID. Systems with special hardware support are called hardware
RAID systems. The hardware RAID provides the following advantages.
Hardware RAID implementations can use nonvolatile RAM to record writes that need to be
executed; in case of power failure before a write is completed, when the power is back, it retrieves
information about incomplete writes from non volatile RAM and then completes the writes. Without such
hardware support, extra work needs to be done to detect blocks that may have been partially written
before.
Some hardware RAID implementations permit hot swapping; that is, faulty disks can be removed
and replaced by new ones without turning power off. Hot swapping reduces the mean time to repair, since
replacement of a disk does not have to wait until a time when the system can be shut down.
The power supply, or the disk controller, or even the system interconnection in a RAID system
could become a single point of failure that could stop functioning of the RAID system. To avoid this
possibility, good RAID implementations have multiple redundant power supplies (with battery backups
so they continue to function even if power fails). Such RAID systems have multiple disk controllers, and
multiple interconnections to connect them to the computer system. Thus, failure of any single component
will not stop the functioning of the RAID system.
4.4 File Organization:
A file is organized as a sequence of records. These records are mapped onto disk blocks. The
record organization methods are of the following
Fixed Length Records:
Consider a file of account records for the bank database. Each record of the file is defined as :
type deposit = record
account-number: char(10);
branch-name: char(22);
balance:real;
end
If it is assumed that each character occupies 1 byte and that a real occupies 8 bytes, then account
record is 40 bytes long. A simple approach is to use the first 40 bytes for the first record, the next 40 bytes
for the second record and so on. The Figure: 5.9 show
how fixed length records are stored in the file.

Figure: 5.9 – Fixed Length Records

Page 10 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

However, there are two problems with this simple approach:


 It is difficult to delete a record from this structure. The space occupied by the record to be
deleted must be filled with some other record of the file, or we must have a way of marking
deleted records so that they can be ignored.
 If the block size happens to be a greater than 40 bytes, some records will cross block
boundaries. That is part of the record will be stored in one block and part in another. It would
thus require two block accesses to read or write such a record.
The deletion can be performed in several ways:
The first approach is when a record is deleted all the records after it should be moved up to the
deleted position. Instead of this approach, it might be easier simply to move the final record of the file
into the space occupied by the deleted record. Another approach is to reuse the space of the deleted
record by inserting new record in that place. This approach avoids the movement of records. Since it is
hard find the available space, it is desirable to use some additional structure. At the beginning of the file,
a certain number of bytes are allocated as a file header. The header will contain a variety of information
about the file. In addition to all the information it maintains the address of the first record whose contents
are deleted. And this first record is used to store the address of the second available record and so on. This
stored address is called as pointers, since they point to the location of a record. The deleted records thus
form a linked list, which is often referred to as a free list. The Figure: 5.12 Shows the file with the free
list after records 1, 4, and 6 have been deleted.

Figure: 5.12 - File of Figure: 5.9, with free list after deletion of records 1,4 and 6.
On insertion of a new record, we can use the record pointed by the header. The header pointer is
change to point to the next available record after insertion. If no space is available the insertion is done at
the end of the file.
Variable Length Records:
Variable-length records arise in database systems in several ways:
o Storage of multiple record types in a file.
o Record types that allow variable lengths for one or more fields.
o Record types that allow repeating fields (used in some older data models).
Consider a different representation of the account information, in which one variable length record
is used for each branch name and for all account information in that branch. The format of the record is:
type account-list = record
branch-name: char (22);
Page 11 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

account-info: array [1 .. ∞] of
record;
account-number: char(l0);
balance: real;
end
end
The account-info is defined as an array with an arbitrary number of elements. That is, the type
definition does not limit the number of elements in the array, although any actual record will have a
specific number of elements in its array. There is no limit on how large a record can be (up to, of course,
the size of the disk storage).
Byte String Representation:
A simple method for implementing variable-length records is to attach a special end-of-.record
( ) symbol to the end of each record. Then each record is stored as a string of consecutive bytes. The

Figure: 5.13 show, such an organization to represent the account file as variable-length records.

Figure: 5.13 - Byte-String Representation of Variable-Length Records


An alternative version of the byte-string representation stores the record length at the beginning of
each record, instead of using end-of-record symbols.
The byte-string representation as described has some disadvantages:
 It is not easy to reuse space occupied formerly by a deleted record.
 There is no space, in general, for records to grow longer. If a variable-length record becomes
longer, it must be moved-movement is costly if pointers to the record are stored elsewhere in the
database (e.g., in indices, or in other records), since the pointers must be located and updated.
Thus, the basic byte-string representation described here not usually used for implementing variable-
length records. However, a modified form of the byte-string representation, called the slotted-page
structure, is commonly used for organizing records within a single block. The slotted-page structure
appears in the Figure: 5.14

Figure: 5.14 - Slotted Page Structure


Page 12 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

There is a header at the beginning of each block, containing the following information:
 The number of record entries in the header.
 The end of free space in the block
 An array whose entries contain the location and size of each record
The actual records are allocated contiguously in the block, starting from the end of the block. The free
space in the block is contiguous, between the final entry in the header array, and the first record. If
a record is inserted, space is allocated for it at the end of free space, and an entry containing its size and
location is added to the header.
If a record is deleted, the space that it occupies is freed, and its entry is set to deleted (its size is set
to -1, for example). Further, the records in the block before the deleted record are moved, so that the free
space created by the deletion gets occupied and all free space is again between the final entry in the
header array and the first record. The end-of-free-space pointer in the header is appropriately updated as
well. Records can be grown or shrunk by similar techniques, as long as there is space in the block.
The slotted-page structure requires that there be no pointers that point directly to records. Instead,
pointers must point to the entry in the header that contains the actual location of the record. This level of
indirection allows records to be moved to prevent fragmentation of space inside a block, while supporting
indirect pointers to the record.
Fixed-length Representation
Another way to implement variable-length records efficiently in a file system is to use one or
more fixed-length records to represent one variable-length record.
There are two ways of doing this:
 Reserved space: If there is a maximum record length that is never exceeded, then fixed-length
records of that length is used. Unused space (for records shorter than the maximum space) is filled
with a special null, or end-of-record, symbol.
 List representation: variable-length records can be represented by lists of fixed length records,
chained together by pointers.
If the reserved-space method is applied to the account example, it is needed to select a maximum
record length. The Figure: 5.15 shows, how the file of account would be represented if, maximum of
three accounts per branch are allowed.

Figure: 5.15 - Fixed-Length Representation


A record in this file is of the account-list type, but with the array containing exactly three
elements. Those branches with fewer than three accounts (for example, Round Hill) have records with
null fields. The symbol ( ) is used to represent this situation in Figure. The reserved-space method is

Page 13 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

useful when most of the records have a length close to the maximum. Otherwise, a significant amount of
space may be wasted.
In the bank example, some branches may have many more accounts than others. This situation
leads to consider the linked list method. To represent the file by the linked list method, a pointer field
should be added. The resulting structure appears in the Figure: 5.16.

Figure: 5.16 - Pointer Method


A disadvantage to the structure of the above Figure: 5.16 is that space is wasted in all records
except the first in a chain. The first record needs to have the branch-name value, but subsequent records
do not. But, it is needed to include a field for branch-name in all records. This wasted space is significant.
To deal with this problem, two kinds of blocks are allowed in a file:
 Anchor block, which contains the first record of a chain
 Overflow block, which contains records other than those that are the first record of a chain
Thus, all records within a block have the same length, even though not all records in the file have
the same length. The Figure: 5.17 show this file structure.

Figure: 5.17 - Pointer Method-Using Anchor block Overflow block


4.5 Organization of Records in Files:
An instance of a relation is a set of records. Given a set of records, the next question is how to
organize them in a file. Several of the possible ways of organizing records in files are:

Page 14 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

 Heap file organization (heap files): In this simplest and most basic type of organization, records
are placed in the file in the order in which they are inserted, so new records are inserted at the end
of the file. Such an organization is called a heap or pile file. Sequential file organization (sorted
files): Records are stored in sequential order according to the value of a "search key" of each
record.
 Hashing file organization: A hash function is computed on some attribute of each record. The
result of the hash function specifies in which block of the file the record should be placed.
 Clustering file organization: Generally, a separate file is used to store the records of each
relation. However, in a clustering file organization, records of several different relations are stored
in the same file; further, related records of the different relations are stored on the same block, so
that one I/O operation fetches related records from all the relations. For example, records of the
two relations can be considered to be related if they would match in a join of the two relations.
Heap file organization (heap files or unordered files):
In this simplest and most basic type of organization, records are placed in the file in the order
in which they are inserted, so new records are inserted at the end of the file. Such an organization is
called a heap or pile file. This organization is often used with additional access paths, such as the
secondary indexes. It is also used to collect and store records for future use.
Inserting a new record is efficient. The last disk block of the file is copied into a buffer. The new
record is added and then the block is then rewritten back to the disk. The address of the last file block is
kept in the file header. However, searching for a record using any search condition involves a linear
search through the file by block, which is an expensive procedure. If only one record satisfies the
search condition, then, on the average, a program will read into memory and search half the file blocks
before it finds the record. For a file b blocks requires searching (b/2) blocks on average. If no records
satisfy the search condition, the program must read and search all b blocks in the file.
To delete a record, a program must first find its block, copy the block into the buffer, and finally
rewrite the block back to the disk. This leaves unused space in the disk block. Deleting, a large number
of records in this way results in wasted storage space. Another technique used for record deletion is to
have an extra byte or bit, called a deletion marker, stored with each record. A record is deleted by
setting the deletion marker to a certain value. A different value of the marker indicates a valid (not
deleted) record. Search programs consider only valid records in a block when conducting their search.
Both of these deletion techniques require periodic reorganization of the file to reclaim the unused space
of deleted records. During reorganization, the file blocks are accessed consecutively, and records are
packed by removing deleted records. After such reorganization, the blocks are filled to capacity once
more. Another possibility is to use the space when inserting records although this requires extra
bookkeeping to keep track of empty locations.
Sequential File Organization (sorted files or ordered files):
A sequential file organization is designed for efficient processing of records in sorted order based
on some search key. A search key is any attribute or set of attributes. It need not be the primary key, or
even a super key. To permit fast retrieval of records in search-key order, the records are chained together
by pointers. The pointer in each record points to the next record in search-key order. Furthermore, to

Page 15 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

minimize the number of block accesses in sequential file processing, the records are stored physically in
search-key order, or as close to search-key order as possible.
Figure: 15.18 show a sequential file of account records taken from the banking example. In that
example, the records are stored in search-key order, using branch-name as the search key.

Figure: 5.18 - Sequential File for account Records


The sequential file organization allows records to be read in sorted order; that can be useful for
display purposes, as well as for certain query-processing algorithms. It is difficult, however, to maintain
physical sequential order as records are inserted and deleted, since it is costly to move many records
as a result of a single insertion or deletion or deletion. Deletion can be managed by using pointer chains.
For insertion, the following rules are applied:
 Locate the record in the file that comes before the record to be inserted in search-key order.
 If there is a free record (that is, space left after a deletion) within the same block as this record,
insert the new record there. Otherwise, insert the new record in an overflow block. In either case,
adjust the pointers so as to chain together the records in search-key order.
The Figure: 5.19 show the file of account, after the insertion of the record (North Town, A-888,
800). The structure in the figure allows fast insertion of new records, but forces sequential file-processing
applications to process records in an order that does not match the physical order of the records. If
relatively few records need to be stored in overflow blocks, this approach works well. Eventually,
however, the correspondence between search-key order and physical order may be totally lost, in which
case sequential processing will become much less efficient. At this point, the file should be reorganized
so that it is once again physically in sequential order. Such reorganizations are costly; and must be done
during times when the system load is low. The frequency with which reorganizations are needed depends
on the frequency of insertion of new records. In the extreme case in which insertions rarely occur, it is
possible always to keep the file in physically sorted order.

Figure: 5.19 - Sequential File Organization after an Insertion


Page 16 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

Clustering File Organization:


Many relational-database systems store each relation in a separate file, so that they can take full
advantage of the file system that the operating system provides. This simple approach to relational-
database implementation becomes less satisfactory as the size of the database increases. There are
performance advantages to be gained from careful assignment of records to blocks, and from careful
organization of the blocks themselves. A more complicated file structure may be beneficial, even if the
strategy of storing each relation in a separate file is used.
However, many large-scale database systems do not rely directly on the underlying operating
system for file management. Instead, one large operating-system file is allocated to the database system.
The database system stores all relations in this one file, and manages the file itself. To see the advantage
of storing many relations in one file, consider the following SQL query for the bank database:
select account-number, customer-name, customer-street, customer-city from depositor, customer
where depositor. customer_name = customer.customer_name
This query computes a join of the depositor and customer relations. Thus, for each tuple of
depositor, the system must locate the customer tuples with the same value for customer-name. Regardless
of how these records are located, however, they need to be transferred from disk into main memory. In
the worst case, each record will reside on a different block, forcing us to do one block read for each
record required by the query. As an example, consider the depositor and customer relations of given
below.

Figure: 5.20 - Depositor Relation

Figure: 5.21 - Customer Relation


The Figure: 5.22 show a file structure designed for efficient execution of queries involving
depositor customer.

Figure: 5.22 - Multiple clustering file structure


Page 17 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

The depositor tuples for each customer-name are stored near the customer tuple for the
corresponding customer name. This structure mixes together tuples of two relations, but allows for
efficient processing of the join. When a tuple of the customer relation is read, the entire block containing
that tuple is copied from disk into main memory. Since the corresponding depositor tuples are stored on
the disk near the customer tuple, it is also copied. If a customer has so many accounts that the depositor
records do not fit in one block, the remaining records appear on nearby blocks.
A clustering file organization is a file organization, such as that illustrated in the Figure: 5.22
that stores related records of two or more relations in each block. Such a file organization allows us to
read records that would satisfy the join condition by using one block read.
The use of clustering has enhanced processing of a particular join (depositor customer), but it
results in slowing processing of other types of query. For example,
Select * from customer
requires more block accesses than it did in the scheme under which each relation is stored in a separate
file. Instead of several customer records appearing in one block each record is located in a distinct
block. Indeed, simply finding all the customer records is not possible without some additional structure.
To locate all tuples of customer relation in the structure of the

Figure: 5.22, it is needed to chain together all records of that relation using pointers, as in the Figure:
5.23. The usage of clustering depends on the types of query that the database designer believes to be most
frequent. Careful use of clustering can produce significant performance gains in query processing.

Figure: 5.23 – Multiple Clustering File Structure with Pointer Chains

4.6 Indexing:
Database system indices play the same role as book indices or card catalogs in the libraries. For
example, to retrieve an account record given the account number, the database system would look up an
index to find on which disk block the corresponding record resides, and then fetch the disk block, to get
the account record.
There are two basic kinds of indices:
 Ordered indices: Based on a sorted ordering of the values.
 Hash indices: Based on a uniform distribution of values across a range of buckets. The bucket to
which a value is assigned is determined by a function called a hash function.

Page 18 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

Several techniques exist for both ordered indexing and hashing. No one technique is the best.
Rather, each technique is best suited to particular database applications. Each technique must be
evaluated on the basis of these factors:
 Access types: The types of access that are supported efficiently. Access types can include finding
records with a specified attribute value and finding records, whose attribute values fall in a
specified range.
 Access time: The time it takes to find a particular data item, or set of items using the technique in
question.
 Insertion time: The time it takes to insert a new data item. This value includes the time it takes to
find the correct place to insert the new data item, as well as the time it takes to update the index
structure.
 Deletion time: The time it takes to delete a data item. This value includes the time it takes to find
the item to be deleted, as well as the time it takes to update the index structure.
 Space overhead: The additional space occupied by an index structure.
Applications often want to have more than one index for a file. For example, libraries maintained
several card catalogs: for author, for subject, and for title.
An attribute or set of attributes used to look up records in a file is called a search key. This
definition of key differs from that used in primary key, candidate key, and super key. If there are several
indices on a file, then there are several search keys.
Ordered Indices
To gain fast random access to records in a file, an index structure is used. Each index structure is
associated with a particular search key. Just like the index of a book or a library catalog an ordered index
stores the values of the search keys in sorted order, and associates with each search key the records that
contain it.
Ordered indices can be categorized as primary index and secondary index.
Primary Index
In this index, it is assumed that all files are ordered sequentially on some search key. Such files,
with a primary index on the search key, are called index-sequential files. They represent one of the oldest
index schemes used in database systems. They are designed for applications that require both sequential
processing of the entire file and random access to individual records.
The Figure: 5.24 show a sequential file of account records taken from the banking example. In
the example figure, the records are stored in search-key order, with branch-name used as the search key.

Figure: 5.24 – Sequential file for account records.

Page 19 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

Dense and Sparse Indices


An index record, or index entry, consists of a search-key value, and pointers to one or more
records with that value as their search-key value. The pointer to a record consists of the identifier of a
disk block and an offset within the disk block to identify the record within the block.
There are two types of ordered indices that can be used:
Dense index: an index record appears for every search-key value in the file. In a dense primary
index, the index record contains the search-key value and a pointer to the first data record with that
search-key value. The rest of the records with the same search key-value would be stored sequentially
after the first record, since, because the index is a primary one, records are sorted on the same search key.
Dense index implementations may store a list of pointers to all records with the same search-key
value; doing so is not essential for primary indices. The Figure: 5.25 show the dense index for the
account file.

Figure: 5.25 - Dense Index.


Sparse index: An index record appears for only some of the search-key values. To locate a record
we find the index entry with the largest search-key value that is less than or equal to the search key value
for which we are looking. We start at the record pointed to by that index entry, and follow the pointers in
the file until we find the desired record. The Figure: 5.26 show the sparse index for the account file.

Figure: 5.26 - Sparse Index.


Suppose that we are looking up records for the Perryridge branch. Using the dense index we
follow the pointer directly to the first Perryridge record. We process this record and follow the pointer in
that record to locate the next record in search-key (branch-name) order. We continue processing records
until we encounter a record for a branch other than Perryridge. If we are using the sparse index, we do not
find an index entry for "Perryridge". Since, the last entry (in alphabetic order) before "Perryridge" is
"Mianus" we follow that pointer. We then read the account file in sequential order until we find the first
Perryridge record, and begin processing at that point.
Thus, it is generally faster to locate a record in a dense index; rather than a sparse index. However,
sparse indices have advantages over dense indices in that they require less space and they impose less
maintenance overhead for insertions and deletions.

Page 20 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

There is a trade-off that the system designer must make between access time and space overhead.
Although the decision regarding this trade-off depends on the specific application, a good compromise is
to have a sparse index with one index entry per block. The reason this design is a good trade-off is that
the dominant cost in processing a database request is the time that it takes to bring a block from disk into
main memory. Once we have brought in the block, the time to scan the entire block is negligible. Using
this sparse index, we locate the block containing the record that we are seeking. Thus, unless the record is
on an overflow block, we minimize block accesses while keeping the size of the index as small as
possible.
Multi level indices
Even if the sparse index is used, the index itself may become too large for efficient processing. It
is not unreasonable, in practice, to have a file with 100,000 records, with 10 records stored in each block.
If we have one index record per block, the index has 10,000 records. Index records are smaller than data
records, so let us assume that 100 index records fit on a block. Thus, our index occupies 100 blocks. Such
large indices are stored as sequential files on disk.
If an index is sufficiently small to be kept in main memory, the search time to find an entry is low.
However, if the index is so large that it must be kept on disk, a search for an entry requires several disk
block reads. Binary search can be used on the index file to locate an entry, but the search still has a large
cost. If overflow blocks have been used, binary search will not be possible. In that case, a sequential
search is typically used, and that requires b block reads, which will take even longer. Thus, the process of
searching a large index may be costly.
To deal with this problem, we treat the index just as we would treat any other sequential file, and
construct a sparse index on the primary index, as in the Figure: 5.27
To locate a record, we first use binary search on the outer index to find the record for the largest
search-key value less than or equal to the one that we desire. The pointer points to a block of the inner
index. We scan this block until we find the record that has the largest search-key value less than or equal
to the one that we desire. The pointer in this record points to the block of the file that contains the record
for which we are looking
.

Figure: 5.27 – Two-level Sparse Index


Using the two levels of indexing, we have read only one index block, rather than the seven we
read with binary search, if we assume that the outer index is already in main memory. If our file is
Page 21 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

extremely large, even the outer index may grow too large to fit in main memory. In such a case, we can
create yet another level of index. Indices with two or more levels are called multilevel indices. Searching
for records with a multilevel index requires significantly fewer I/O operations than does searching for
records by binary search.
A typical dictionary is an example of a multilevel index in the non database world. The header of
each page lists the first word alphabetically on that page. Such a book index is a multilevel index: The
words at the top of each page of the book index form a sparse index on the contents of the dictionary
pages.
Index Update
Regardless of what form of index is used, every index must be updated whenever a record is either
inserted into or deleted from the file. These are the algorithms used for updating single level indices.
 Insertion: First, the system performs a lookup using the search-key value that appears in the record to
be inserted. Again, the actions the system takes next depend on whether the index is dense or sparse:
 Dense indices:
the following actions are taken:
 If the index record stores pointers to all records with the same search-key value,
the system adds a pointer to the new record to the index record.
 Otherwise, the index record stores a pointer to only the first record with the
search-key value. The system then places the record being inserted after the
other records with the same search-key values.
 Sparse indices:
We assume that the index stores an entry for each block. If the system creates a
new block, it inserts the first search-key value (in search-key order) appearing in the new
block into the index. On the other hand, if the new record has the least search-key value in
its block, the system updates the index entry pointing to the block; if not, the system makes
no change to the index.
Deletion. To delete a record, the system first looks up the record to be deleted. The actions the
system takes next depend on whether the index is dense or sparse.
 Dense indices:
1. If the deleted record was the only record with its particular search-key value, then
the system deletes the corresponding index record from the index.
2. Otherwise the following actions are taken:
 If the index record stores pointers to all records with the same search-key
value, the system deletes the pointer to the deleted record from the index
record.
 Otherwise, the index record stores a pointer to only the first record with the
search-key value. In this case, if the deleted record was the first record with
the search-key value, the system updates index record to point to the next
record.
 Sparse indices:

Page 22 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

1. If the index does not contain an index record with the search-key value of the
deleted record, nothing needs to be done to the index.
2. Otherwise the system takes the following actions: If the deleted record was the only
record with its search key, the system replaces the corresponding index record with
an index record for the next search-key value (in search-key order). If the next
search-key value already has an index entry, the entry is deleted instead of being
replaced. Otherwise, if the index record for the search-key value points to record
being deleted, the system updates the index record to point to the next record with
the same search-key value.
Secondary Indices:
Secondary indices must be dense, with an index entry for every search-key value, and, a pointer to
every record in the file. A primary index may be sparse, storing only some of the search-key values, since
it is always possible to find records with intermediate, search-key values by a sequential access to a part
of the file. If a secondary index stores only some of the search-key values, records with intermediate
search-key values may be anywhere in the file and, in general, we cannot find them without searching the
entire file.
A secondary index on a candidate key looks just like a dense primary index, except that the
records pointed to by successive values in the index are not stored sequentially. In general, however,
secondary indices may have a different structure from primary indices. If the search key of a primary
index is not a candidate key, it suffices if the index points to the first record with a particular value for the
search key, since the other records can be fetched by a sequential scan of the file.
In contrast, if the search key of a secondary index is not a candidate key, it is not enough to point
to just the first record with each search-key value. The remaining records with the same search-key value
could be anywhere in the file, since records are ordered by the search key of the primary index, rather
than by the search key of the secondary index. Therefore, a secondary index must contain pointers to all
the records.
We can use an extra level of indirection to implement secondary indices on search keys that are
not candidate keys. The pointers in such a secondary index do not point directly to the file. Instead, each
points to a bucket that contains pointers to the file. The Figure: 5.28 shows the structure of a secondary
index that uses an extra level of indirection on the account file, on the search key balance.

Figure: 5.28 - Secondary index on account file on noncandidate key balance.


Page 23 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

A sequential scan in primary index order is efficient because records in the file are stored
physically in the same order as the index order. However, we cannot (except in rare special cases) store a
file physically ordered both by the search key of the primary index, and the search key of a secondary
index. Because secondary-key order and physical-key order differ, if we attempt to scan the file
sequentially in secondary-key order, the reading of each record is likely to require the reading of a new
block from disk, which is very slow.
The procedure described earlier for deletion and insertion can also be applied to secondary
indices;
4.7 Hashing Techniques:
Hashing is a type of primary file organization, which provides very fast access to records on
certain search conditions. This organization is called as hash file.
The idea behind the hashing is to provide a function h, called a hash function or randomizing
function that is applied to the hash filed value of a record and yields the address of the disk block in
which the record is stored. A search for the record within the block can be carried out in a main memory
buffer.
The various types of hashing techniques are:
 Internal hashing
 External hashing
 Dynamic hashing
Internal Hashing:
For internal files, hashing is typically implemented as a hash table through the use of an array of
records. If the array index ranges from o to M-1 then there will be M slots whose address corresponds to
the array indexes. A hash function is chosen that transforms the hash field value into an integer between 0
and M-1. one common hash function is the h(k) = K mod M function, which returns the remainder of an
integer hash field value K after division by M. this value is then used for the record address.

Non integer hash filed values can be transformed into integers before the mod function is applied.
For character strings, the numeric (ASCII) codes associated with character strings can be used in the
transformation. For example, by multiplying those code values.

Page 24 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

For a hash filed value whose data type is a string of 20 characters, the below algorithm is used to
calculate the hash address:

Algorithm:

temp<-1;

for I <- 1 to 20 do

temp <- temp * code (k[i]) mod M;

hash_address <- temp mod M;

Other hashing techniques are given below:

Folding involves applying an arithmetic function such as addition or a logical function such as
exclusive or to different portions of the hash field value to calculate the hash address. Another technique
involves picking some digits of the hash field value – for example, the third, fifth and eighth digits to
form the hash address.

The problem with most hashing functions is that they do not guarantee that distinct values will
hash to addresses, because the hash field space (the number of possible values a hash field can take) is
usually much larger than the address space (the number of available addresses for records).

A collision occurs when the hash field value of a record that is being inserted hashes to an address
that already contains a different record. In this situation the new record must be inserted in some other
position, since its hash address is occupied. The process of finding another position is called collision
resolution. There are numerous methods for collision resolution as given below:

 Open addressing: proceeding from the occupied position specified by the hash address,
the program checks the subsequent positions in order until an unused (empty) position is
found. The below algorithm may be used for this purpose.
Algorithm:
i <- hash_address (k);
a <- i;
if location I is occupied
then begin i <- (i + 1) mod M;
while (i <> a) and location I is occupied do
i <- (i +1) mod M;
if (i = a) then all positions are full
else new_hash_address <- i;
end;
 Chaining: For this method, various overflow locations are kept, for extending the array
with a number of overflow positions. In addition a pointer field is added to each record

Page 25 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

location. A collision is resolved by placing the new record in an unused overflow location
and setting the pointer of the occupied hash address location to the address of that
overflow location. A linked list of overflow records for each hash address is maintained as
shown in the figure 2.
 Multiple Hashing: The program applies a second hash function if the first results in a
collision, if another collision results, the program uses open addressing or applies a third
hash function and then uses open addressing if necessary.

External Hashing for Disk Files:

Hashing for disk files is called external hashing. To suit the characteristics of the disk storage
the target address space is made up of buckets, each of which holds multiple records. A bucket is either
one disk block or a cluster of contiguous blocks. The hashing function maps a relative key into a relative
bucket number, rather than assign an absolute block address to the bucket. A table is maintained in the
file header, which converts the bucket number into the corresponding disk block address as in the figure
3.

The collision problem is less severe with buckets, because as many records as will fit in a bucket
can hash to the same bucket without causing the problems. However collision occurs if the bucket is filled
and a new record being inserted hashes to the same bucket. This is resolved by using the variation of
chaining, in which a pointer is maintained in each bucket to a linked list of overflow records for the
bucket as shown in the below figure 4.

The pointers in the linked list should be record pointers, which include both a block address and a
relative record position within the block.

Page 26 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

Disadvantages: The hashing described above is called as static hashing because a fixed number of
buckets M is allocated. This can be a serious drawback for dynamic files. When using external hashing,
searching for a record given a value of some other field other than the hash field is as expensive as in the
case of an unordered file.

Figure 3

Figure 4

Dynamic Hashing or Extendible Hashing:

A major drawback of the static hashing is that the hash address space is fixed. Hence it is difficult
to expand or shrink the file dynamically. The dynamic or extendible hashing is used to remedy the
problems in static hashing.

In the extendible hashing a type of directory (an array of 2d bucket addresses) is maintained. Here
d is called the global depth of the directory. The integer value corresponding to the first higher order d
bits of a hash value is used as an index to the array to determine a directory entry, and the address in that
Page 27 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

entry determines the bucket in which the corresponding records are stored. However, there does not have
to be a distinct bucket for each of the 2d directory locations. Several directory locations with the same first
d’ bits for their hash values may contain the same bucket address if all the records that hash to these
locations fit in a single bucket. A local depth d’ stored with each bucket specifies the number of bits on
which the bucket contents are based. The below figure shows the directory with global depth = 3.

The value of d can be increased or decreased by one at a time, thus doubling or halving the
number of entries in the directory array. Doubling is needed if a bucket, whose local depth d’ is equal to
the global depth d, overflows. Halving is needed if d>d’ for all the buckets after some deletions occur.
Most record retrievals require two block accesses- one to the directory and the other to the bucket.

To illustrate the bucket splitting, suppose that a new record inserted causes overflow in the bucket
whose hash value start with 01 (the third bucket in the figure). The records will be distributed between
two buckets. The first contains all records whose hash values start with 010, and the second all those hash
values start with 011. Now the two directory locations, 010 and 011 point to the two new distinct buckets.
Before the split they pointed to the same bucket. The local depth d’ of the two new buckets is 3 which is
one more that the local depth of the old bucket.

The main advantage of the extendible hashing is that the performance of the fields does not
degrade as the file grows, as opposed to static external hashing where collisions increase and the
corresponding chaining causes the additional accesses. In addition no space is allocated in extendible
hashing for future growth, but additional buckets can be allocated dynamically as needed.

4.8 B+ tree Index Files:


B+-tree indices are an alternative to indexed-sequential files.
 Disadvantage of indexed-sequential files
o performance degrades as file grows, since many overflow blocks get created.
o Periodic reorganization of entire file is required.

Page 28 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

 Advantage of B+-tree index files:


o Automatically reorganizes itself with small, local, changes, in the face of insertions and
deletions.
o Reorganization of entire file is not required to maintain performance.
 (Minor) disadvantage of B+-trees:
o Extra insertion and deletion overhead, space overhead.
 Advantages of B+-trees outweigh disadvantages
o B+-trees are used extensively
+
A B -tree is a rooted tree satisfying the following properties:
 All paths from root to leaf are of the same length
 Each node that is not a root or a leaf has between n/2 and n children.
 A leaf node has between (n–1)/2 and n–1 values
 Special cases:
o If the root is not a leaf, it has at least 2 children.
o If the root is a leaf (that is, there are no other nodes in the tree), it can have between 0 and
(n–1) values.
+
B -Tree Node Structure
 Typical node

Figure: 5.38 - B+-Tree Node Structure


o Ki are the search-key values
o Pi are pointers to children (for non-leaf nodes) or pointers to records or buckets of records
(for leaf nodes).
 The search-keys in a node are ordered
 K1 < K2 < K3 < . . . < Kn–1

Page 29 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

Leaf Nodes in B+-Trees


Properties of a leaf node:
 For i = 1, 2, . . ., n–1, pointer Pi either points to a file record with search-key value Ki, or to a bucket
of pointers to file records, each record having search-key value Ki. Only need bucket structure if
search-key does not form a primary key.
 If Li, Lj are leaf nodes and i < j, Li’s search-key values are less than Lj’s search-key values
 Pn points to next leaf
node in search-key
order

Figure: 5.39 – A Leaf Node for account B+-Tree index(n=3).


Non-Leaf Nodes in B+-Trees
 Non leaf nodes form a multi-level sparse index on the leaf nodes. For a non-leaf node with m
pointers:
o All the search-keys in the subtree to which P1 points are less than K1
o For 2 i n – 1, all the search-keys in the subtree to which Pi points have values greater
than or equal to Ki–1 and less than Ki
o All the search-keys in the subtree to which Pn points have values greater than or equal to
Kn–1

Figure: 5.40 - Non leaf node


Example of a B+-tree

Figure: 5.41 - B+-tree for account file (n = 3)

Figure: 5.42 - B+-tree for account file (n = 5)


Page 30 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

 Leaf nodes must have between 2 and 4 values


( (n–1)/2 and n –1, with n = 5).
 Non-leaf nodes other than root must have between 3 and 5 children ( (n/2 and n with n =5).
 Root must have at least 2 children.
Queries on B+ Trees:
Find all records with a search-key value of k.
1. Start with the root node
 Examine the node for the smallest search-key value > k.
 If such a value exists, assume it is Kj. Then follow Pi to the child node
 Otherwise k Km–1, where there are m pointers in the node. Then follow Pm to the
child node.
2. If the node reached by following the pointer above is not a leaf node, repeat the above
procedure on the node, and follow the corresponding pointer.
3. Eventually reach a leaf node. If for some i, key Ki = k follow pointer Pi to the desired
record or bucket. Else no record with search-key value k exists.
+
Updates on B -Trees: Insertion
 Find the leaf node in which the search-key value would appear
 If the search-key value is already there in the leaf node, record is added to file and if necessary a
pointer is inserted into the bucket.
 If the search-key value is not there, then add the record to the main file and create a bucket if
necessary. Then:
 If there is room in the leaf node, insert (key-value, pointer) pair in the leaf node
 Otherwise, split the node.

Page 31 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

Splitting a node:
 take the n(search-key value, pointer) pairs (including the one being inserted) in sorted
order. Place the first n/2 in the original node, and the rest in a new node.
 let the new node be p, and let k be the least key value in p. Insert (k,p) in the parent of the
node being split. If the parent is full, split it and propagate the split further up.
The splitting of nodes proceeds upwards till a node that is not full is found. In the worst case the
root node may be split increasing the height of the tree by 1.

Figure 5.43 Result of splitting node containing Brighton and Downtown on inserting Clearview

Insertion Example:

Figure 5.44: B+-Tree before and after insertion of “Clearview”

Updates on B+-Trees: Deletion


1. Find the record to be deleted, and remove it from the main file and from the bucket (if present)
2. Remove (search-key value, pointer) from the leaf node if there is no bucket or if the bucket has
become empty
3. If the node has too few entries due to the removal, and the entries in the node and a sibling fit into
a single node, then
 Insert all the search-key values in the two nodes into a single node (the one on the left),
and delete the other node.

Page 32 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

 Delete the pair (Ki–1, Pi), where Pi is the pointer to the deleted node, from its parent,
recursively using the above procedure.
4. Otherwise, if the node has too few entries due to the removal, and the entries in the node and a
sibling fit into a single node, then
 Redistribute the pointers between the node and a sibling such that both have more than the
minimum number of entries.
 Update the corresponding search-key value in the parent of the node.
5. The node deletions may cascade upwards till a node which has n/2 or more pointers is found.
If the root node has only one pointer after deletion, it is deleted and the sole child becomes the root.

Figure 5.55 before and after deleting “Downtown”

Page 33 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

4.9 B-Tree Index Files:


 Similar to B+-tree, but B-tree allows search-key values to appear only once; eliminates redundant
storage of search keys.
 Search keys in nonleaf nodes appear nowhere else in the B-tree; an additional pointer field for each
search key in a nonleaf node must be included.
 Generalized B-tree leaf node(a) and Non leaf node(b) – pointers Bi is the bucket or file record
pointers.

Figure: 5.43 - (a) B-tree leaf node ,(b) Non leaf node
B-Tree Index File Example

Figure: 5.44 – B-tree


 Advantages of B-Tree indices:
o May use less tree nodes than a corresponding B+-Tree.
o Sometimes possible to find search-key value before reaching leaf node.
 Disadvantages of B-Tree indices:
o Only small fraction of all search-key values are found early
o Non-leaf nodes are larger, so fan-out is reduced. Thus, B-Trees typically have greater depth
than corresponding B+-Tree
o Insertion and deletion more complicated than in B+-Trees
o Implementation is harder than B+-Trees.
 Typically, advantages of B-Trees do not out weigh disadvantages.

Page 34 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

4.10 Distributed Databases:


A distributed database (DDB) is a collection of multiple logically interrelated databases
distributed over a computer network, and a distributed database management system acts as a software
system that manages a distributed database while making the distribution transparent to the user.

Distributed database Architecture:

Types of distributed databases:

Homogenous and Heterogeneous Databases:

In a homogeneous distributed database, all sites have identical database management system
software, are aware of one another, and agree to cooperate in processing users' requests. In such a system,
local sites surrender a portion of their autonomy in terms of their right to change schemas or database
management system software. That software also cooperates with other sites in exchanging information
about transactions, to make transaction processing possible across multiple sites.
W in d o w
S ite 5 U n ix
O ra c le S ite 1
O ra c le
W in d o w
S ite 4 C o m m u n ic a tio n s
n e tw o rk

O ra c le
S ite 3 S ite 2
L in u x O ra c le L in u x O ra c le

In contrast, in a heterogeneous distributed database, different sites may use different schemas, and
different database management system software. The sites may not be aware of one another, and they
may provide only limited facilities for cooperation in transaction processing.

O b je c t U n ix R e la tio n a l
O rie n te d S ite 5 U n ix
S ite 1
H ie ra rc h ic a l
W in d o w
S ite 4 C o m m u n ic a tio n s
n e tw o rk

N e tw o rk
O b je c t D B M S
O rie n te d S ite 3 S ite 2 R e la tio n a l
L in u x L in u x
Page 35 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

Distributed Data Storage

Consider a relation r that is to be stored in the database. There are two approaches to storing this
relation in the distributed database:

 Replication: The system maintains several identical replicas (copies) of the relation, and stores
each replica at a different site. The alternative to replication is to store only one copy of relation r.
 Fragmentation: The system partitions the relation into several fragments, and stores each
fragment at a different site.
Fragmentation and replication can be combined: A relation can be partitioned into several
fragments and there may be several replicas of each fragment.

Data Replication
If relation r is replicated, a copy of relation r is stored in two or more sites. In the most extreme
case, we have full replication, in which a copy is stored in every site in the system.
There are a number of advantages and disadvantages to replication.
 Availability: If one of the sites containing relation r fails, then the relation r can be found in
another site. Thus, the system can continue to process queries involving r, despite the failure of
one site.
 Increased parallelism: In the case where the majority of accesses to the relation r result in only
the reading of the relation, then several sites can process queries involving r in parallel. The more
replicas of r there are, the greater the chance that the needed data will be found in the site where
the transaction is executing. Hence, data replication minimizes movement of data between sites.
 Increased overhead on update: The system must ensure that all replicas of a relation are
consistent; otherwise, erroneous computations may result. Thus, whenever r is updated, the update
must be propagated to all sites containing replicas. The result is increased overhead. For example,
in a banking system, where account information is replicated in various sites, it is necessary to en-
sure that the balance in a particular account agrees in all sites.

Data Fragmentation
If relation r is fragmented, r is divided into a number of fragments r1, r2,..., rn. These fragments
contain sufficient information to allow reconstruction of the original relation r. There are two different
schemes for fragmenting a relation:
 Horizontal fragmentation
 Vertical fragmentation
Horizontal fragmentation splits the relation by assigning each tuple of r to one or more
fragments.
Vertical fragmentation splits the relation by decomposing the scheme R of relation r.
These approaches are illustrated by fragmenting the relation account, as below:
Account-schema = (account-number, branch-name, balance)

Page 36 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

In horizontal fragmentation, a relation r is partitioned into a number of subsets, r1,r2,…,rn. Each


tuple of relation r must belong to at least one of the fragments, so that the original relation can be
reconstructed, if needed.
As an illustration, the account relation can be divided into several different fragments, each of
which consists of tuples of accounts belonging to a particular branch. If the banking system has only two
branches Hillside and Valley view then there are two different fragments:
account1 = σbranch-name = “Hillside”(account)
account2 = σbranch-name = “Valleyview”(account)

Horizontal fragmentation is usually used to keep tuples at the sites where they are used the most,
to minimize data transfer. In general, a horizontal fragment can be defined as a selection on the global
relation r. That is, we use a predicate Pi to construct fragment ri:
ri = σpi(r)

The reconstruction of the relation r is made by taking the union of all fragments; that is,

r = r1 U r2 U··· U rn.

In vertical fragmentation is the decomposition is made. Vertical fragmentation of r(R) involves the
definition of several subsets of attributes RI, R2, ... , Rn of the schema R so that

R = RI U R2 U…U Rn
Each fragment Ti of r is defined by
ri = Ri(r)
The fragmentation should be done in such a way that the reconstruction of relation r is made from
the fragments by taking the natural join

r = r1 r2 r3 … rn

One way of ensuring that the relation r can be reconstructed is to include the primary-key
attributes of R in each of the Ri. More generally, any super key can be used. It is often convenient to add
a special attribute, called a tuple-id, to the schema R. The tuple-id value of a tuple is a unique value that
distinguishes the tuple from all other tuples. The tuple-id attribute thus serves as a candidate key for the
augmented schema, and is included in each of the Ris. The physical or logical address for a tuple can be
used as a tuple-id, since each tuple has a unique address.
To illustrate vertical fragmentation, consider a university database with a relation employee-info
that stores, for each employee, employee-id, name, designation, and salary. For privacy reasons, this
relation may be fragmented into a relation employee-private-info containing employee-id and salary, and

Page 37 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

another relation employee-public-info containing attributes employee-id, name, and designation. These
may be stored at different sites, again for security reasons.
The two types of fragmentation can be applied to a single schema; for instance, the fragments
obtained by horizontally fragmenting a relation can be further partitioned vertically. Fragments can also
be replicated. In general, a fragment can be replicated; replicas of fragments can be fragmented further,
and so on.

Transparency
The user of a distributed database system should not be required to know either where the data are
physically located or how the data can be accessed at the specific local site. This characteristic, called
data transparency, can take several forms:
 Fragmentation transparency: Users are not required to know how a relation has been
fragmented.
 Replication transparency: Users view each data object as logically unique. The distributed
system may replicate an object to increase either system performance or data availability. Users do
not have to be concerned with what data objects have been replicated, or where replicas have been
placed.
 Location transparency: Users are not required to know the physical location of the data. The
distributed database system should be able to find any data as long as the data identifier is
supplied by the user transaction.
Distributed Transactions
Access to the various data items in a distributed system is usually accomplished through
transactions, which must preserve the ACID properties. There are two types of transaction that need to be
considered. The local transactions are those that access and update data in only one local database; the
global transactions are those that access and update data in several local databases. Ensuring the ACID
properties of the local transactions can be done easily. However, for global transactions, this task is much
more complicated, since several sites may be participating in execution. The failure of one of these sites,
or the failure of a communication link connecting these sites, may result in erroneous computations.
For this the system structure is maintained for a distributed database.
System Structure
Each site has its own local transaction manager, whose function is to ensure the ACID properties
of those transactions that execute at that site. The various transaction managers cooperate to execute
global transactions. To understand how such a manager can be implemented, consider an abstract model
of a transaction system, in which each site contains two subsystems:
The transaction manager manages the execution of those transactions (or sub transactions) that
access data stored in a local site. Note that each such transaction may be either a local transaction (that is,
a transaction that executes at only that site) or part of a global transaction (that is, a transaction that
executes at several sites).
The transaction coordinator coordinates the execution of the various transactions (both local and
global) initiated at that site.
The overall structure appears in the below figure:
Page 38 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

The structure of a transaction manager is similar in many respects to the structure of a centralized
system. Each transaction manager is responsible for

 Maintaining a log for recovery purposes


 Participating in an appropriate concurrency-control scheme to coordinate the concurrent execution
of the transactions executing at that site
The transaction coordinator subsystem is not needed in the centralized environment, since,
transaction access data at only a single site. A transaction coordinator, as its name implies, is responsible
for coordinating the execution of all the transactions initiated at that site. For each such transaction, the
coordinator is responsible for

 Starting the execution of the transaction


 Breaking the transaction into a number of sub transactions and distributing these sub transactions
to the appropriate sites for execution
 Coordinating the termination of the transaction, which might result in the transaction being
committed at all sites or aborted at all sites.

4.11 Client server DBMS:


Centralized DBMS Architecture:
All DBMS functionality, application program execution, and user interface processing are carried
out on one machine. The client/server architecture was developed to deal with computing environments in
which a large number of PCs, workstations, file servers, printers, data base servers, Web servers, e-mail
servers, and other software and equipment are connected via a network. The idea is to define specialized
servers with specific functionalities. For example, it is possible to connect a number of PCs or small
workstations as clients to a file server that maintains the files of the client machines. Another machine can
be designated as a printer server by being connected to various printers; all print requests by the clients
are forwarded to this machine. Web servers or e-mail servers also fall into the specialized server category.
The resources provided by specialized servers can be accessed by many client machines.

Page 39 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

The concept of client/server architecture assumes an underlying framework that consists of many
PCs and workstations as well as a smaller number of mainframe machines, connected via LANs and other
types of computer networks. A client in this framework is typically a user machine that provides user
interface capabilities and local processing. When a client requires access to additional functionality—such
as database access—that does not exist at that machine, it connects to a server that provides the needed
functionality. A
Server is a system containing both hardware and software that can provide services to the client machines,
such as file access, printing, archiving, or database access. In general, some machines install only client
software, others only server software, and still others may include both client and server software, as
illustrated in Figure 2.6. However, it is more common that client and server software usually run on
separate machines.
Two-Tier Client/Server Architectures for DBMS:
In two-tier architectures the software components are distributed over two systems: client and
server
The user interface programs and application programs can run on the client side. When DBMS access is
required, the program establishes a connection to the DBMS (which is on the server side); once the
connection is created, the client program can communicate with the DBMS. A standard called Open
Database Connectivity (ODBC) provides an application programming interface, which allows client-side
programs to call the DBMS, as long as both client and server machines have the necessary software
installed. A client program can actually connect to several RDBMSs and send query and transaction
requests using the ODBC API, which are then processed at the server sites. Any query results are sent
back to the client program, which can process and display the results as needed. A related standard for the
Java programming language, called JDBC, has also been defined. This allows Java client programs to
access one or more DBMSs through a standard interface.

Page 40 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

Three-Tier Architecture:
Many Web applications use an architecture called the three-tier architecture, which adds an
intermediate layer between the client and the database server, as illustrated in Figure 2.7(a)

The intermediate layer or middle layer is sometimes called the application server or Web server.
This server plays an intermediary role by running application programs and storing business rules
(procedures or constraints) that are used to access data from the database server. It can also improve
database security by checking a client’s credentials before forwarding a request to the database server.
Clients contain GUI interfaces and some additional application-specific business rules. The intermediate
server accepts requests from the client, processes the request and sends database queries and commands
to the database server, where it may be processed further and filtered to be presented to users in GUI
format.

Page 41 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

4.12 Multidimensional Database:


A multidimensional database is a computer software system designed to allow for the efficient and
convenient storage and retrieval of large volumes of data that is (1) intimately related and (2) stored,
viewed and analyzed from different perspectives. These perspectives are called dimensions.

Getting answers to typical business questions from raw data often requires viewing thatdata from
various perspectives. For example, an automobile marketer wanting to improve business activity might
want to examine sales data collected throughout the organization.The evaluation would entail viewing
historical sales volume figures from multiple perspectives such as:

· Sales volumes by model


· Sales volumes by color
· Sales volumes by dealership
· Sales volumes over time
Analyzing the Sales Volumes data from any one or more of the above perspectives can yield
answers to important questions such as: What is the trend in sales volumes over a period of time for a
specific model and color across a specific group of dealerships?

Having the ability to respond to these types of inquiries in a timely fashion allows managers to
formulate effective strategies, identify trends and improve their overall ability to make important business
decisions. Certainly, relational databases could answer the question above, but query results must also
come to the manager in a meaningful and timely way. End users needing interactive access to large
volumes of data stored in a relational environment are often frustrated by poor response times and lack of
flexibility offered by relational database technology and their SQL query building tools.

Relational Database Structure:


The Table 1 below illustrates a typical relational database representation of a Sales Volumes
dataset for the Gleason automobile dealership. The data in our relational table is stored in "records."
Each record corresponds to a row of the table and each record is divided into "fields." The fields in a table
are arranged vertically in columns. In our example, the fields are SALES VOLUMES, COLOR and
MODEL. The SALES VOLUMES field contains the data we wish to analyze. The COLOR and MODEL
fields contain the perspectives we will analyze the data from. Thus, the first record in the table below tells
us that six sales were made for blue mini vans.

Page 42 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

Table 1:SALES VOLUMES FOR GLEASON DEALERSHIP


MODEL COLOR SALES VOLUME

MINI VAN BLUE 6


MINI VAN RED 5
MINI VAN WHITE 4
SPORTS COUPE BLUE 3
SPORTS COUPE RED 5
SPORTS COUPE WHITE 5
SEDAN BLUE 4
SEDAN RED 3
SEDAN WHITE 2

If we examine the dataset by columns (fields), we discover that the first field, MODEL, ranges
across only three possible values: MINI VAN, SPORTS COUPE and SEDAN.The second field, COLOR,
also ranges across three possible values: BLUE, RED and WHITE.
Cross Tab Views Or Arrays (Multidimensional Database):
There is an alternative way of representing this data. The figure below displays what is commonly
known as a "cross tab" view, or data matrix. In this representation the SALES VOLUMES figures are
located at the x-axis/y-axis intersections of a 3x3 matrix.

In an array, each axis is called a dimension (one of our data perspectives), and each element
within a dimension is called a position. In this example, the first dimension is MODEL; it has three
positions: MINI VAN, SEDAN and COUPE. The second dimension is COLOR; it also has three
positions: BLUE, WHITE and RED. The Sales Volume figures are located at the intersections of the
dimension positions. These intersections are called cells and are populated with our data or measures.
In short, our multidimensional array structure represents a higher level of organization than the
relational table. The structure itself contains much valuable "intelligence" regarding the relationships
Page 43 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

between the data elements because our "perspectives" are imbedded directly in the structure as
dimensions as opposed to being placed into fields. For example, the structure of our relational table can
only tell us that there are three fields: COLOR, MODEL and DEALERSHIP. The relational structure tells
us nothing about the possible contents of those fields. The structure of the array, on the other hand, tells
us not only that there are two dimensions, COLOR and MODEL, but it also presents all possible values of
each dimension as positions along the dimension.
Increasingly Complex Relational Tables
Thus far we have examined two dimensional datasets as viewed in relational tables and in arrays.
This can easily be extended to a third dimension. We will now add a new field called DEALERSHIP
(with three possible values) to our relational table:
Table 2: SALES VOLUMES FOR ALL DEALERSHIP

It is obvious that the addition of the third field (DEALERSHIP) with three possible values
(CLYDE, GLEASON, CARR) has made this relational table an even more awkward vehicle for
presenting data to the end user.

Multidimensional Simplification
In a multidimensional structure, however, the new DEALERSHIP field translates directly into a
third dimension with three positions. We now have a 3x3x3 array containing 27 cells. Once again, the end
user is able to view data along presorted dimensions with the data arranged in an inherently more
organized, and accessible fashion than the one offered by the relational table.

Page 44 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

Adding Dimensions:
Our three dimensional example can be extended to four dimensions by adding a time dimension to
indicate the month of the year in which a sale was made. Visualizing a fourth dimension is more difficult
than visualizing three. Imagine twelve boxes into which our three dimension array can be placed. When
in the JANUARY box, the cells of the array contain Sales Volume figures for JANUARY. When in the
FEBRUARY box, the cells of the array contain Sales Volume figures for the month of FEBRUARY, and
so on.

Real World Benefits:


1. Ease of Data Presentation and Navigation
2. Ease of Maintenance
3. Performance
Applications of multidimensional technology:
1. Financial Analysis and Reporting
2. Budgeting
3. Promotion Tracking
4. Quality Assurance and Quality Control
5. Product Profitability
6. Survey Analysis

Page 45 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

Difference between relational and multidimensional database:

4.13 Parallel Databases


Parallel database systems combine database management and parallel processing to increase
performance and availability .A parallel database system can be loosely defined as a DBMS implemented
on a parallel computer. Ideally, a parallel database system should provide the following advantages.
1. High-performance.
2. High-availability.
3. Extensibility.
Functional Architecture:
The functions supported by a parallel database system can be divided into three subsystems much
like in a typical DBMS.

Figure. General Architecture of a Parallel Database System

Page 46 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

1. Session Manager: It plays the role of a transaction monitor, providing support for client
interactions with the server. In particular, it performs the connections and disconnections between
the client processes and the two other subsystems. Therefore, it initiates and closes user sessions
(which may contain multiple transactions).
2. Transaction Manager: It receives client transactions related to query compilation and execution. It
can access the database directory that holds all meta-information about data and programs. The
directory itself should be managed as a database in the server. Depending on the transaction, it
activates the various compilation phases, triggers query execution, and returns the results as well
as error codes to the client application.
3. Data Manager: It provides all the low-level functions needed to run compiled queries in parallel,
i.e., database operator execution, parallel transaction support, cache management, etc. If the
transaction manager is able to compile dataflow control, then synchronization and communication
among data manager modules is possible. Otherwise, transaction control and synchronization must
be done by a transaction manager module

Parallel DBMS Architectures:


There are three basic parallel computer architectures depending on how main memory or disk is
shared: shared-memory, shared-disk and shared-nothing.
 Shared-Memory:
In the shared-memory approach (see Figure 1), any processor has access to any memory
module or disk unit through a fast interconnect (e.g., a high-speed bus or a cross-bar switch). All
the processors are under the control of a single operating system. Examples of shared-memory
parallel database systems include XPRS, DBS3, and Volcano, as well as portings of major
commercial DBMSs on SMP.

Fig 1 Shared-Memory Architecture

Shared-memory has two strong advantages: simplicity and load balancing.


Shared-memory has three problems: high cost, limited extensibility and low availability.
 Shared-Disk:
In the shared-disk approach (see Figure 2), any processor has access to any disk unit
through the interconnect but exclusive (non-shared) access to its main memory. Each processor-
memory node is under the control of its own copy of the operating system. Then, each processor
can access database pages on the shared disk and cache them into its own memory. Since different
Page 47 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

processors can access the same page in conflicting update modes, global cache consistency is
needed.

Fig:2 Shared-Disk Architecture

The first parallel DBMS that used shared-disk is Oracle with an efficient implementation
of a distributed lock manager for cache consistency. Other major DBMS vendors such as IBM,
Microsoft and Sybase provide shared-disk implementations.
Shared-disk has a number of advantages: lower cost, high extensibility, load balancing,
availability, and easy migration from centralized systems. The cost of the interconnect is
significantly less than with shared-memory since standard bus technology may be used.

 Shared-Nothing
In the shared-nothing approach (see Figure 3), each processor has exclusive access to its
main memory and disk unit(s). Similar to shared-disk, each processor memory disk node is under
the control of its own copy of the operating system. Then, each node can be viewed as a local site
(with its own database and software) in a distributed database system.

Fig:3 Shared- Nothing Architecture


As demonstrated by the existing products, shared-nothing has three main virtues: lower
cost, high extensibility, and high availability. Shared-nothing is much more complex to manage
than either shared-memory or shared-disk. Higher complexity is due to the necessary
implementation of distributed database functions assuming large numbers of nodes.

Page 48 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

Parallel Data Placement:


Data placement in a parallel database system exhibits similarities with data fragmentation in distributed
databases. Data placement must be done so as to maximize system performance, which can be measured by
combining the total amount of work done by the system and the response time of individual queries. There are three
basic strategies for data partitioning (data placement): round-robin, hash, and range partitioning

Figure 4: Data Partitioning Schemes


1. Round-robin partitioning is the simplest strategy, which ensures uniform data distribution. With n
partitions, the ith tuple in insertion order is assigned to partition (i mod n). This strategy enables the
sequential access to a relation to be done in parallel. However, the direct access to individual tuples,
based on a predicate, requires accessing the entire relation.
2. Hash partitioning applies a hash function to some attribute that yields the partition number. This
strategy allows exact-match queries on the selection attribute to be processed by exactly one node and all
other queries to be processed by all the nodes in parallel.
3. Range partitioning distributes tuples based on the value intervals (ranges) of some attribute. In
addition to supporting exact-match queries (as in hashing), it is well-suited for range queries. For
instance, a query with a predicate “A between A1 and A2” may be processed by the only node(s)
containing tuples whose A value is in range [A1; A2]. However, range partitioning can result in high
variation in partition size.

Query Parallelism:
Parallel query execution can exploit two forms of parallelism: inter- and intra-query. Inter-query
parallelism enables the parallel execution of multiple queries generated by concurrent transactions, in
order to increase the transactional throughput. Within a query (intra-query parallelism), inter-operator and
intra-operator parallelism are used to decrease response time.

Intra-operator Parallelism
Intra-operator parallelism is based on the decomposition of one operator in a set of independent
sub-operators, called operator instances. This decomposition is done using static and/or dynamic
partitioning of relations. Each operator instance will then process one relation partition, also called a
bucket. The operator decomposition frequently benefits from the initial partitioning of the data (e.g., the
data are partitioned on the join attribute).
Page 49 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

Inter-operator Parallelism
Two forms of inter-operator parallelism can be exploited. With pipeline parallelism, several
operators with a producer-consumer link are executed in parallel. For instance, the select operator in
Figure below will be executed in parallel with the join operator. The advantage of such execution is that
the intermediate result is not materialized, thus saving memory and disk accesses. Independent
parallelism is achieved when there is no dependency between the operators that are executed in parallel.

4.14 Spatial Databases


A spatial database stores objects that have spatial characteristics that describe them and that have
spatial relationships among them. The spatial relationships among the objects are important, and they are
often needed when querying the database. A spatial database is optimized to store and query data related
to objects in space, including points, lines and polygons. Satellite images are a prominent example of
spatial data. Queries posed on these spatial data, where predicates for selection deal with spatial
parameters, are called spatial queries.

Table next shows the common analytical operations involved in processing geographic or spatial
data.
Analysis Type Type of Operations and Measurements

Distance, perimeter, shape, adjacency, and


Measurements
direction

Pattern, autocorrelation, and indexes of


Spatial analysis/statistics similarity and topology using spatial and
non spatial data

Flow analysis Connectivity and shortest path

Page 50 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

Location analysis Analysis of points and lines within a


polygon

Slope/aspect, catchment area, drainage


Terrain analysis
network

Search Thematic search, search by region

 Measurement operations are used to measure some global properties of single objects and to
measure the relative position of different objects in terms of distance and direction.
 Spatial analysis operations, which often use statistical techniques, are used to uncover spatial
relationships within and among mapped data layers.
 Flow analysis operations help in determining the shortest path between two points.
 Location analysis aims to find if the given set of points and lines lie within a given polygon
(location).
 Digital terrain analysis is used to build three-dimensional models, where the topography of a
geographical location can be represented with an x, y, z data model known as Digital Terrain (or
Elevation) Model (DTM/DEM).
 Spatial search allows a user to search for objects within a particular spatial region. For example,
thematic search allows us to search for objects related to a particular theme or class, such as
“Find all water bodies within 25 miles of Atlanta” where the class is water.

Spatial Data Types:


Spatial data comes in three basic forms. These forms have become a de facto standard.
 Map Data includes various geographic or spatial features of objects in a map, such as an object’s
shape and the location of the object within the map. The three basic types of features are points,
lines, and polygons (or areas). Points are used to represent spatial characteristics of objects whose
locations correspond to a single 2-d coordinate (x, y, or longitude/latitude) in the scale of a
particular application. Lines represent objects having length .Polygons are used to represent
spatial characteristics of objects that have a boundary
 Attribute data is the descriptive data that GIS systems associate with map features.
 Image data includes data such as satellite images and aerial photographs, which are typically
created by cameras. Images can also be attributes of map features. Aerial and satellite images are
typical examples of raster data.

Models of Spatial Information:


Models of spatial information are sometimes grouped into two broad categories: field and object.
A spatial application (such as remote sensing or highway traffic control) is modeled using either a field-
or an object-based model, depending on the requirements and the traditional choice of model for the
application. Field models are often used to model spatial data that is continuous in nature, such as terrain
Page 51 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

elevation, temperature data, and soil variation characteristics, whereas object models have traditionally
been used for applications such as transportation networks, land parcels, buildings, and other objects that
possess both spatial and non-spatial attributes.

Spatial Operators:
Spatial operators are used to capture all the relevant geometric properties of objects embedded in
the physical space and the relations between them, as well as to perform spatial analysis. Operators are
classified into three broad categories.
(i)Topological operators: Topological properties are invariant when topological transformations are
applied. These properties do not change after transformations like rotation, translation, or scaling.
Topological operators are hierarchically structured in several levels, where the base level offers operators
the ability to check for detailed topological relations between regions with a broad boundary, Examples
include open (region), close (region), and inside (point, loop).
(ii) Projective operators: Projective operators, such as convex hull, are used to express predicates about
the concavity/convexity of objects as well as other spatial relations.
(iii)Metric operators: Metric operators provide a more specific description of the object’s geometry.
They are used to measure some global properties of single objects (such as the area, relative size of an
object’s parts, compactness, and symmetry), and to measure the relative position of different objects in
terms of distance and direction. Examples include length (arc) and distance (point, point).
The operations performed by the operators mentioned above are static, in the sense that the
operands are not affected by the application of the operation.

Dynamic Spatial Operators:


Dynamic operations alter the objects upon which the operations act. The three fundamental
dynamic operations are create, destroy, and update. A representative example of dynamic operations
would be updating a spatial object that can be subdivided into translate (shift position), rotate (change
orientation), scale up or down, reflect (produce a mirror image), and shear (deform).

Spatial Queries:
Spatial queries are requests for spatial data that require the use of spatial operations. Three typical
types of spatial queries are:
 Range query. Finds the objects of a particular type that are within a given spatial area or within a
particular distance from a given location.
 Nearest neighbor query. Finds an object of a particular type that is closest to a given location.
 Spatial joins or overlays. Typically joins the objects of two types based on some spatial
condition, such as the objects intersecting or overlapping spatially or being within a certain
distance of one another.

Spatial Data Indexing:


A spatial index is used to organize objects into a set of buckets so that objects in a particular
spatial region can be easily located. Each bucket has a bucket region, a part of space containing all objects
Page 52 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

stored in the bucket. The bucket regions are usually rectangles. There are two ways of providing a spatial
index.
 Specialized indexing structures that allow efficient search for data objects based on spatial
search operations are included in the database system. Examples of these indexing structures are
grid files and R-trees. Special types of spatial indexes, known as spatial join indexes, can be used
to speed up spatial join operations.
 Instead of creating brand new indexing structures, the two-dimensional (2-d) spatial data is
converted to single-dimensional (1-d) data, so that traditional indexing techniques (B+-tree) can
be used. The algorithms for converting from 2-d to 1-d are known as space filling curves.

Spatial Indexing Techniques:


 Grid File:They can be used for indexing 2-dimensional and higher n dimensional spatial data.
The fixed-grid method divides an n-dimensional hyperspace into equal size buckets. This
structure is useful for uniformly distributed data like satellite imagery. However, the fixed-grid
structure is rigid, and its directory can be sparse and large.

 R-Trees. The R-tree is a height-balanced tree, which is an extension of the B+-tree for k-
dimensions, where k > 1. For two dimensions (2-d), spatial objects are approximated in the R-tree
by their minimum bounding rectangle (MBR) which is the smallest rectangle, with sides
parallel to the coordinate system (x and y) axis, that contains the object. R-trees are characterized
by the following properties, which are similar to the properties for B+-trees but are adapted to 2-d
spatial objects. We use M to indicate the maximum number of entries that can fit in an R-tree
node.
1. The structure of each index entry (or index record) in a leaf node is (I, object-
identifier), where I is the MBR for the spatial object whose identifier is object-
identifier.
2. Every node except the root node must be at least half full. Thus, a leaf node,
intermediate node should contain m entries.
3. All leaf nodes are at the same level, and the root node should have at least two pointers
unless it is a leaf node.
4. All MBRs have their sides parallel to the axes of the global coordinate system.

 Quadtrees generally divide each space or subspace into equally sized areas and to identify the
positions of various objects.

 Spatial Join Index precomputes a spatial join operation and stores the pointers to the related
object in an index structure. Join indexes improve the performance of recurring join queries. A
join index can be described by a bipartite graph G = (V1,V2,E), where V1 contains the tuple-ids
of relation R, and V2 contains the tuple-ids of relation S. Edge set contains an edge (vr,vs) for vr
in R and vs in if there is a tuple corresponding to (vr,vs) in the join index..

Page 53 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

Example of R tree:

Applications of Spatial Data:


 Astronomy
 Genomics
 Multimedia information systems
 Bioinformatics
 Geography
 Remote sensing
 Urban planning
 Natural resource management

4.15 Multimedia Databases:


Multimedia databases provide features that allow users to store and query different types of
multimedia information, which includes images, video clips, audio clips and documents. Database queries
that are needed involve locating multimedia sources that contain certain objects of interest. The queries
are referred to as content-based retrieval, because the multimedia source is being retrieved based on its
containing certain objects or activities.
Identifying the contents of multimedia sources is a difficult and time-consuming task. There are
two main approaches. The first is based on automatic analysis of the multimedia source to identify certain
mathematical characteristics of their contents. The second approach depends on manual identification of
the objects and activities of interest in each multimedia source. It can be applied to all multimedia
sources, but it requires a manual preprocessing phase where a person has to scan each multimedia source
to identify and catalog the objects and activities it contains.

Characteristics of Multimedia Data Sources:

Page 54 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

 Image:
An image is typically stored either in raw form as a set of pixel or cell values, or in
compressed form to save space. The image shape descriptor describes the geometric shape of the raw
image, which is typically a rectangle of cells of a certain width and height. Hence, each image can be
represented by an m by n grid of cells. Each cell contains a pixel value that describes the cell content.
In black-and-white images, pixels can be one bit. In gray scale or color images, a pixel is multiple
bits. Because images may require large amounts of space, they are often stored in compressed form.
Compression standards, such as GIF, JPEG, or MPEG, use various mathematical transformations to
reduce the number of cells stored but still maintain the main image characteristics. Applicable
mathematical transforms include Discrete Fourier Transform (DFT), Discrete Cosine Transform
(DCT), and wavelet transforms.
 Video source:
It is typically represented as a sequence of frames, where each frame is a still image. However,
rather than identifying the objects and activities in every individual frame, the video is divided into
video segments, where each segment comprises a sequence of contiguous frames that includes the
same objects/activities. Each segment is identified by its starting and ending frames. The objects and
activities identified in each video segment can be used to index the segments. An indexing technique
called frame segment trees has been proposed for video indexing. The index includes both objects,
such as persons, houses, and cars, as well as activities, such as a person delivering a speech or two
people talking. Videos are also often compressed using standards such as MPEG.
 Audio sources:
It includes stored recorded messages, such as speeches, class presentations, or even
surveillance recordings of phone messages or conversations by law enforcement. Here, discrete
transforms can be used to identify the main characteristics of a certain person’s voice in order to have
similarity-based indexing and retrieval.
 Text/document source:
It is basically the full text of some article, book, or magazine. These sources are typically
indexed by identifying the keywords that appear in the text and their relative frequencies. However,
filler words or common words called stopwords are eliminated from the process. Because there can
be many keywords when attempting to index a collection of documents, techniques have been
developed to reduce the number of keywords to those that are most relevant to the collection. A
dimensionality reduction technique called singular value decompositions (SVD), which is based on
matrix transformations, can be used for this purpose. An indexing technique called telescoping vector
trees (TV-trees), can then be used to group similar documents.
Automatic Analysis of Images
Analysis of multimedia sources is critical to support any type of query or search interface. We
need to represent multimedia source data such as images in terms of features that would enable us to
define similarity. The work done so far in this area uses low-level visual features such as color, texture,
and shape, which are directly related to the perceptual aspects of image content. These features are easy to

Page 55 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

extract and represent, and it is convenient to design similarity measures based on their statistical
properties.
 Color is one of the most widely used visual features in content-based image retrieval since it does
not depend upon image size or orientation. Retrieval based on color similarity is mainly done by
computing a color histogram for each image that identifies the proportion of pixels within an
image for the three color channels (red, green, blue—RGB). However, RGB representation is
affected by the orientation of the object with respect to illumination and camera direction.
Therefore, current image retrieval techniques compute color histograms using competing invariant
representations such as HSV (hue, saturation, value). HSV describes colors as points in a cylinder
whose central axis ranges from black at the bottom to white at the top with neutral colors between
them. The angle around the axis corresponds to the hue, the distance from the axis corresponds to
the saturation, and the distance along the axis corresponds to the value (brightness).
 Texture refers to the patterns in an image that present the properties of homogeneity that do not
result from the presence of a single color or intensity value. Examples of texture classes are rough
and silky. Examples of textures that can be identified include pressed calf leather, straw matting,
cotton canvas, and so on. Just as pictures are represented by arrays of pixels (picture elements),
textures are represented by arrays of texels (texture elements). These textures are then placed into
a number of sets, depending on how many textures are identified in the image. These sets not only
contain the texture definition but also indicate where in the image the texture is located. Texture
identification is primarily done by modeling it as a two dimensional, gray-level variation. The
relative brightness of pairs of pixels is computed to estimate the degree of contrast, regularity,
coarseness, and directionality.
 Shape refers to the shape of a region within an image. It is generally determined by applying
segmentation or edge detection to an image. Segmentation is a region based approach that uses an
entire region (sets of pixels), whereas edge detection is a boundary-based approach that uses only
the outer boundary characteristics of entities. Shape representation is typically required to be
invariant to translation, rotation, and scaling. Some well-known methods for shape representation
include Fourier descriptors and moment invariants.

Object Recognition in Images:


Object recognition is the task of identifying real-world objects in an image or a video sequence.
The system must be able to identify the object even when the images of the object vary in viewpoints,
size, scale, or even when they are rotated or translated. An important contribution to this field was made
by Lowe, who used scale invariant features from images to perform reliable object recognition. This
approach is called scale-invariant feature transform (SIFT).
The SIFT features are invariant to image scaling and rotation, and partially invariant to change in
illumination and 3D camera viewpoint. They are well localized in both the spatial and frequency domains,
reducing the probability of disruption by occlusion, clutter, or noise. In addition, the features are highly
distinctive, which allows a single feature to be correctly matched with high probability against a large
database of features, providing a basis for object and scene recognition.

Page 56 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

For image matching and recognition, SIFT features (also known as keypoint features) are first
extracted from a set of reference images and stored in a database. Object recognition is then performed by
comparing each feature from the new image with the features stored in the database and finding candidate
matching features based on the Euclidean distance of their feature vectors. Since the keypoint features are
highly distinctive, a single feature can be correctly matched with good probability in a large database of
features.

4.16 Mobile Database:


A mobile database is a database that can be connected to by a mobile computing device over a
wireless mobile network. Mobile databases
 Physically separate from the central database server.
 Resided on mobile devices.
 Capable of communicating with a central database server or other mobile clients from remote
sites.
 Handle local queries without connectivity.

Why Mobile Database:


Mobile data-driven applications enable us to access any data from anywhere, anytime.
Examples:
 Salespersons can update sales records on the move.
 Reporters can update news database anytime.
 Doctors can retrieve patient’s medical history from anywhere.
Mobile DBMSs are needed to support these applications data processing capabilities.

Client Server Mobile Database:


Client-server model is the traditional model of information systems.
 It is the dominant model for existing mobile databases.
 The server can become a single point of failure and performance bottleneck.
 Even storing data on a cluster of machines to backup central database might cause
performance bottleneck and data inconsistency.

Page 57 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

Peer to Peer Mobile Databases:


In P2P mobile databases, the database maintenance activities are distributed among clients.
 Every process plays part of the role of the server, besides its client role.
 A client that wants to access a piece of data, sends a request to other peer clients and they
forward the request until the data is found.
 The major problem in this model is ensuring the availability of data .

Characteristics of mobile environments:


 Restricted bandwidth of wireless networks.
 Limited power supply.
 Limited resources.
 Mobility.
 Disconnections.

Current Approaches:
Currently most mobile application developers use “flat files” to store application data.
A “flat file” is a file containing records that have no structured interrelationship.
Advantages:
Smaller and easier to manage.
Disadvantages:
 Applications need to know the organization of the records within the file.
 Developers have to implement the required database functionalities.
Requirements of Mobile DBMS:
Mobile DBMSs should satisfy the following requirements :
 Small memory footprint.
 Flash-optimized storage system.
 Data synchronization.
 Security.
 Low power consumption.
Page 58 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

 Self-management.
 Embeddable in applications
Small memory footprint:
 Memory footprint is amount of main memory that an application uses while running.
 Mobile devices have limited memory, so the mobile database application should have a small
footprint.
 The size of mobile database affects the overall application footprint .
 Mobile DBMSs should be customizable to include only the required database functionalities.
Flash-optimized storage system:
Flash memories are dominant storage devices for portable devices. They have feature such as:
 Small size.
 Better shock resistance.
 Low power consumption.
 Fast access time.
 No mechanical seek and rotational latency.
 Mobile DBMSs need to be optimized to exploit the advantages of the new storage devices.
Data Synchronization:
 Portable devices cannot stay connected all the time.
 Users can access and manipulate data on their devices.
 They are also unable to store a large amount of data due to lack of storage capacity.
 Mobile DBMSs should have the synchronize functionality to integrate different versions of data
into a consistent version.

Page 59 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

Security:

 Security is very important for data-centric mobile applications.


 It is more important when the application works with critical data that its disclosure results in
potential loss or damage.
 Data that are transmitted over a wireless network are more prone to security issues.
 Data that are transmitted over a wireless network are more prone to security issues.
Low Power Consumption:
 Portable devices have limited power supplies.
 Battery life of mobile phones is expected to increase only 20% over the next 10 years.
 Processor, display and network connectivity are the main power consumers in a mobile device.
 Mobile DBMSs need to be optimized for efficient power consumption
Self Management:
 In traditional databases, the database administrator (DBA) is responsible for databases
maintenance.
 In mobile DBMSs there can be no DBA to manage the database.
 Mobile DBMSs need to support self-management and automatically perform the DBA tasks.
 Some mobile DBMSs allow remote management that enables a DBA to manage the mobile
databases from a remote location.
Embeddable in Applications:
 An administrator does not have direct access to mobile devices.
 Mobile DBMSs should be an integral part of the application that can be delivered as a part of the
applications.
 The database must be embeddable as a DLL file in the applications.
 It must be also possible to deploy the database as a stand-alone DBMS with support of multiple
transactions.
Existing Mobile Databases:
 Sybase SQL Anywhere
 Oracle Lite
 Microsoft SQL Server Compact
 SQLite
 IBM DB2 Everyplace (DB2e)

4.17 Web Databases:


A web database is a database that can be queried and/or updated through the World Wide Web
(WWW). As web technologies are evolving, the WWW turned out to be the preferred medium for many
applications, e.g., e-commerce and digital libraries. These applications use information that is stored in
huge databases and can only be retrieved by issuing direct queries to the back-end databases. Database-
driven web sites have their own interfaces and access forms that create HTML pages on-the-fly. Web

Page 60 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

database technologies define the way that these forms can connect to and retrieve data from database
servers.
The number of database-driven web sites is growing exponentially. The dynamically created web
pages by these sites are hard to be reached by traditional search engines. Traditional search engines are
used to crawl and index static HTML pages. However, traditional search engines cannot send queries to
web databases. The hidden information inside the web database sources is called the “deep web” in
contrast to the “surface web” that is easily accessed by traditional search engines.

Database Connectivity:
Querying the database via direct SQL is one of the most common ways to access data from a
database. The need for a standard way to query different databases arises as the number of existing
database servers is very huge and they differ in their query interfaces. Open DataBase Connectivity
(ODBC) was developed by Microsoft as a standard access method to different databases. The ODBC
interface defines:
 A standard way to connect and log on to a DBMS.
 A standardized representation for data types.
 Libraries of ODBC API function calls that allow an application to connect to a DBMS,
execute SQL statements, and retrieve results.
Using ODBC, writing a program to read data from a database does not need to target a specific
DBMS. All what is needed is to use the database vendor’s supplied ODBC driver to link the code to the
required database.
ODBC Levels of Conformance:
To provide a way for applications and drivers to implement portions of the ODBC API specific to
their needs, the ODBC standard defines conformance levels for drivers in two areas; the ODBC API and
the ODBC SQL grammar. There are three ODBC API conformance levels: Core, Level 1 and Level 2.
Each level of conformance supports a different set of ODBC functions. Table 1 summarizes the set
functions supported in each conformance level. The ODBC SQL grammar conformance levels are:
Minimum, Core and Extended. Each higher level provides more fully-implemented data definition and
data manipulation language support. Table 2 summarizes ODBC SQL grammar conformance levels.
After the success of ODBC, Microsoft introduced OLE DB, an open specification designed to
build on the success of ODBC by providing an open standard for accessing all kinds of data not only data
stored in DBMS.
Java DataBase Connectivity (JDBC):
Later on, the "Write Once, Run Anywhere" Java Platform calls for a new java standard for
database connectivity. Java DataBase Connectivity (JDBC) is the industry standard for database-
independent connectivity between the Java programming language and a wide range of databases. The
JDBC API provides the same functionalities as ODBC but may be invoked from inside java programs.
JDBC-ODBC bridges are drivers that provide JDBC access via ODBC drivers, which is widely used by
developers to connect to databases in a non-Java environment.
Database-to-Web Connectivity:

Page 61 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

Several Database-to-Web connectivity technologies took place. The various technologies differ in
who sends queries to the database. Mainly, a web database environment contains a web browser (the
client), a web server (understands HTTP) and the DBMS (understands SQL). Database-to-web
connectivity asks for a middleman that can understand both HTTP and SQL and can do the
transformations between them back and forth. Database-to-Web connectivity technologies are either two-
tier or three-tier technologies. Figure 1 gives a simple view of the system architecture in both
technologies.
Two-Tier technologies:
In a two-tier architecture, only two tiers are involved in the database-to-web integration. The
client tier (web browser and web server) and the server tier (DBMS). The following are three different
two-tier database-to-web technologies: SSI scripting languages are ASP, ASP.NET, JSP, PHP and
ColdFusion.
Three-Tier Technologies
 Common Gateway Interface (CGI):
CGI is a program that runs on the server tier and is responsible for all the transformations
between HTTP and SQL. A link to the CGI program exists in the web database access form. When
the HTML form is referenced by a client, the web server extracts the query parameters from it and
forwards them to the CGI program on the server tier. The CGI program reads the parameters, formats
them in the appropriate way, and sends a query to the database. Then, the result of the query is sent
back to the client tier as HTML pages formatted by the CGI program. CGI programs can be written in
many languages, e.g., Perl and JavaScript.
 Java Applets:
Java applets are java programs that can be dynamically loaded by the browser and executed in
the client site. Applets use JDBC to connect directly to the DBMS via sockets (no web server is
needed).
 Server Side Includes (SSI):
SSI is a code that is written inside an HTML page and is processed by the web server. When
the HTML page is invoked, the web server executes the scripts, that in turn use ODBC or JDBC to
read data from the DBMS, and then the web server formats the data into HTML pages and sends it
back to the browser.
Example:
In three-tier architecture a Middleware tier (the Application Server) is added between the client
tier and the server tier. The middleware is responsible to transfer data between the web server and the
DBMS. The application server handles all the application operations and connections for the clients.
Application servers offer other functions such as transaction management and load balancing. Sometimes
the application server and web server are merged into the Web Application Server. Examples of
application servers are IBM Websphere, Oracle 9i, and Sun ONE Application servers.

Page 62 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

Figure: Two-Tier web database technologies vs. Three-tier Web database technologies

Page 63 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

4.18 Data Warehousing:


A data warehouse (DW, DWH), or an enterprise data warehouse (EDW), is a system used
for reporting and data analysis. Integrating data from one or more disparate sources creates a central
repository of data, a data warehouse (DW). Data warehouses store current and historical data and are used
for creating trending reports for senior management reporting such as annual and quarterly comparisons.
The data stored in the warehouse is uploaded from the operational systems (such as marketing,
sales, etc., shown in the figure to the right). The data may pass through an operational data store for
additional operations before it is used in the DW for reporting

Characteristics of Data Warehousing:


Subject Oriented
Data warehouses are designed to analyze data. For example, to learn more about the company's
sales data, the company can build a warehouse that concentrates on sales. Using this warehouse, the
company can answer questions like "Who was our best customer for this item last year?” This ability to
define a data warehouse by subject matter, sales in this case makes the data warehouse subject oriented.
Integrated
Integration is closely related to subject orientation. Data warehouses must put data from disparate
sources into a consistent format. They must resolve such problems as naming conflicts and
inconsistencies among units of measure. When they achieve this, they are said to be integrated.
Nonvolatile
Nonvolatile means that, once entered into the warehouse, data should not change. This is logical
because the purpose of a warehouse is to enable you to analyze what has occurred.
Time Variant
In order to discover trends in business, analysts need large amounts of data. This is very much in
contrast to online transaction processing (OLTP) systems, where performance requirements demand
that historical data be moved to an archive. A data warehouse's focus on change over time is what is
meant by the term time variant.
Structure of Data Warehouse:
Figure 29.1 gives an overview of the conceptual structure of a data warehouse. It shows the entire
data warehousing process, which includes possible cleaning and reformatting of data before loading it
into the warehouse. This process is handled by tools known as ETL (extraction, transformation, and
loading) tools. At the back end of the process, OLAP, data mining, and DSS may generate new relevant
information such as rules; this information is shown in the figure going back into the warehouse. The
figure also shows that data sources may include files.

Page 64 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

Data Modeling for Data Warehouses


Multidimensional models take advantage of inherent relationships in data to populate data in
multidimensional matrices called data cubes. For data that lends itself to dimensional formatting, query
performance in multidimensional matrices can be much better than in the relational data model. Three
examples of dimensions in a corporate data warehouse are the corporation’s fiscal periods, products, and
regions.
A standard spreadsheet is a two-dimensional matrix. One example would be a spreadsheet of
regional sales by product for a particular time period. Products could be shown as rows, with sales
revenues for each region comprising the columns. (Figure 29.2 shows this two-dimensional organization.)

Adding a time dimension, such as an organization’s fiscal quarters, would produce a three-
dimensional matrix, which could be represented using a data cube. Figure 29.3 shows a three-dimensional
data cube that organizes product sales data by fiscal quarters and sales regions.

By including additional dimensions, a data hypercube could be produced, although more than
three dimensions cannot be easily visualized or graphically presented. The data can be queried directly in
any combination of dimensions, bypassing complex database queries. Tools exist for viewing data
according to the user’s choice of dimensions. Changing from one-dimensional hierarchy (orientation) to
Page 65 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

another is easily accomplished in a data cube with a technique called pivoting (also called rotation). In
this technique the data cube can be thought of as rotating to show a different orientation of the axes. For
example, you might pivot the data cube to show regional sales revenues as rows, the fiscal quarter
revenue totals as columns, and the company’s products in the third dimension (Figure 29.4).

Multidimensional models lend themselves readily to hierarchical views in what is known as roll-
up display and drill-down display. A roll-up display moves up the hierarchy, grouping into larger units
along a dimension (for example, summing weekly data by quarter or by year). Figure 29.5 shows a roll-up
display that moves from individual products to a coarser-grain of product categories. A drill-down
display shown in Figure 29.6 provides the opposite capability, furnishing a finer grained view, perhaps
disaggregating country sales by region and then regional sales by subregion and also breaking up products
by styles.

The multidimensional storage model involves two types of tables: dimension tables and fact
tables.
 A dimension table consists of tuples of attributes of the dimension.
 A fact table can be thought of as having tuples, one per a recorded fact. This fact contains
some measured or observed variable(s) and identifies it (them) with pointers to dimension
tables. The fact table contains the data, and the dimensions identify each tuple in that data.
Figure 29.7 contains an example of a fact table that can be viewed from the perspective of
multiple dimension tables.
Page 66 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

Two common multidimensional schemas are the star schema and the snowflake schema.
 The star schema consists of a fact table with a single table for each dimension (Figure
29.7).
 The snowflake schema is a variation on the star schema in which the dimensional tables from
a star schema are organized into a hierarchy by normalizing them (Figure 29.8).
 A fact constellation is a set of fact tables that share some dimension tables. Figure 29.9
shows a fact constellation with two fact tables, business results and business forecast.
These share the dimension table called product. Fact constellations limit the possible
queries for the warehouse.

Typical Functionality of a Data Warehouse


Data warehouses exist to facilitate complex, data-intensive, and frequent ad hoc queries.
Accordingly, data warehouses must provide far greater and more efficient query support than is
demanded of transactional databases. The data warehouse access component supports enhanced
spreadsheet functionality, efficient query processing, structured queries, ad hoc queries, data mining, and
materialized views. In particular, enhanced spreadsheet functionality includes support for state-of-the-art
Page 67 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

spreadsheet applications (for example, MS Excel) as well as for OLAP applications programs. These
offer preprogrammed functionalities such as the following:
 Roll-up. Data is summarized with increasing generalization (for example, weekly to
quarterly to annually).
 Drill-down. Increasing levels of detail are revealed (the complement of rollup).
 Pivot. Cross tabulation (also referred to as rotation) is performed.
 Slice and dice. Projection operations are performed on the dimensions.
 Sorting. Data is sorted by ordinal value
 Selection. Data is available by value or range.
 Derived (computed) attributes. Attributes are computed by operations on stored and
derived values.
4.19 Data Mining:
Generally, data mining (sometimes called data or knowledge discovery) is the process of
analyzing data from different perspectives and summarizing it into useful information. Data mining can
be used in conjunction with a data warehouse to help with certain types of decisions. Data mining helps in
extracting meaningful new patterns that cannot necessarily be found by merely querying or processing
data or metadata in the data warehouse.

Data Mining as a Part of the Knowledge Discovery Process:


Knowledge Discovery in Databases, frequently abbreviated as KDD, typically encompasses
more than data mining. The knowledge discovery process comprises six phases: data selection, data
cleansing, enrichment, data transformation or encoding, data mining, and the reporting and display of
the discovered information.
As an example, consider a transaction database maintained by a specialty consumer goods retailer.
Suppose the client data includes a customer name, ZIP Code, phone number, date of purchase, item code,
price, quantity, and total amount.
A variety of new knowledge can be discovered by KDD processing on this client database. During
data selection, data about specific items or categories of items, or from stores in a specific region or area
of the country, may be selected.
The data cleansing process then may correct invalid ZIP Codes or eliminate records with incorrect
phone prefixes.
Enrichment typically enhances the data with additional sources of information. For example,
given the client names and phone numbers, the store may purchase other data about age, income, and
credit rating and append them to each record.
Data transformation and encoding may be done to reduce the amount of data. For instance, item
codes may be grouped in terms of product categories into audio, video, supplies, electronic gadgets,
camera, accessories, and so on. ZIP Codes may be aggregated into geographic regions, incomes may be
divided into ranges, and so on.
It is only after these preprocessing the data mining techniques are used to mine different rules and
patterns. The result of mining may be to discover the following type of new information:
Page 68 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

 Association rules—for example, whenever a customer buys video equipment, he or she also buys
another electronic gadget.
 Sequential patterns—for example, suppose a customer buys a camera, and within three months
he or she buys photographic supplies, then within six months he is likely to buy an accessory item.
This defines a sequential pattern of transactions. A customer who buys more than twice in lean
periods may be likely to buy at least once during the Christmas period.
 Classification trees—for example, customers may be classified by frequency of visits, types of
financing used, amount of purchase, or affinity for types of items; some revealing statistics may be
generated for such classes. We can see that many possibilities exist for discovering new
knowledge about buying patterns, relating factors such as age, income group, place of residence,
to what and how much the customers purchase. This information can then be utilized to plan
additional store locations based on demographics, run store promotions, combine items in
advertisements, or plan seasonal marketing strategies. As this retail store example shows, data
mining must be preceded by significant data preparation before it can yield useful information that
can directly influence business decisions. The results of data mining may be reported in a variety
of formats, such as listings, graphic outputs, summary tables, or visualizations.

Goals of Data Mining and Knowledge Discovery


Data mining is typically carried out with some end goals or applications. These goals fall into the
following classes:
 Prediction:
Data mining can show how certain attributes within the data will behave in the future.
Examples of predictive data mining include the analysis of buying transactions to predict what
consumers will buy under certain discounts, how much sales volume a store will generate in a
given period, and whether deleting a product line will yield more profits. In such applications,
business logic is used coupled with data mining. In a scientific context, certain seismic wave
patterns may predict an earthquake with high probability.
 Identification:
Data patterns can be used to identify the existence of an item, an event, or an activity. For
example, intruders trying to break a system may be identified by the programs executed, files
accessed, and CPU time per session. In biological applications, existence of a gene may be
identified by certain sequences of nucleotide symbols in the DNA sequence. The area known as
authentication is a form of identification. It ascertains whether a user is indeed a specific user or
one from an authorized class, and involves a comparison of parameters or images or signals
against a database.
 Classification:
Data mining can partition the data so that different classes or categories can be identified
based on combinations of parameters. For example, customers in a supermarket can be
categorized into discount seeking shoppers, shoppers in a rush, loyal regular shoppers, shoppers
attached to name brands, and infrequent shoppers. This classification may be used in different

Page 69 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

analyses of customer buying transactions as a postmining activity. Sometimes classification based


on common domain knowledge is used as an input to decompose the mining problem and make it
simpler. For instance, health foods, party foods, or school lunch foods are distinct categories in the
supermarket business. It makes sense to analyze relationships within and across categories as
separate problems. Such categorization may be used to encode the data appropriately before
subjecting it to further data mining.
 Optimization:
One eventual goal of data mining may be to optimize the use of limited resources such as
time, space, money, or materials and to maximize output variables such as sales or profits under a
given set of constraints. As such, this goal of data mining resembles the objective function used in
operations research problems that deals with optimization under constraints.

Types of Knowledge Discovered during Data Mining:


The term knowledge is broadly interpreted as involving some degree of intelligence. There is a
progression from raw data to information to knowledge as we go through additional processing.
Knowledge is often classified as inductive versus deductive. Deductive knowledge deduces new
information based on applying prespecified logical rules of deduction on the given data. Data mining
addresses inductive knowledge, which discovers new rules and patterns from the supplied data.
It is common to describe the knowledge discovered during data mining as follows:
 Association rules. These rules correlate the presence of a set of items with another range of
values for another set of variables. Examples: (1) When a female retail shopper buys a handbag,
she is likely to buy shoes. (2) An X-ray image containing characteristics a and b is likely to also
exhibit characteristic c.
 Classification hierarchies. The goal is to work from an existing set of events or transactions to
create a hierarchy of classes. Examples: (1) A population may be divided into five ranges of credit
worthiness based on a history of previous credit transactions. (2) A model may be developed for
the factors that determine the desirability of a store location on a 1–10 scale. (3) Mutual funds
may be classified based on performance data using characteristics such as growth, income, and
stability.
 Sequential patterns. A sequence of actions or events is sought. Example: If a patient underwent
cardiac bypass surgery for blocked arteries and an aneurysm and later developed high blood urea
within a year of surgery, he or she is likely to suffer from kidney failure within the next 18
months. Detection of sequential patterns is equivalent to detecting associations among events with
certain temporal relationships.
 Patterns within time series. Similarities can be detected within positions of a time series of data,
which is a sequence of data taken at regular intervals, such as daily sales or daily closing stock
prices. Examples: (1) Stocks of a utility company, ABC Power, and a financial company, XYZ
Securities, showed the same pattern during 2009 in terms of closing stock prices. (2) Two
products show the same selling pattern in summer but a different one in winter. (3) A pattern in
solar magnetic wind may be used to predict changes in Earth’s atmospheric conditions.

Page 70 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

 Clustering. A given population of events or items can be partitioned (segmented) into sets of
“similar” elements. Examples: (1) An entire population of treatment data on a disease may be
divided into groups based on the similarity of side effects produced. (2) The adult population in
the United States may be categorized into five groups from most likely to buy to least likely to buy
a new product. (3) The Web accesses made by a collection of users against a set of documents
(say, in a digital library) may be analyzed in terms of the keywords of documents to reveal
clusters or categories of users.

4.20. Data Mart


A data mart is a simple form of a data warehouse that is focused on a single subject (or functional
area), such as Sales, Finance, or Marketing. Data marts are often built and controlled by a single
department within an organization. Given their single-subject focus, data marts usually draw data from
only a few sources. The sources could be internal operational systems, a central data warehouse, or
external data.

How Is It Different from a Data Warehouse?


A data warehouse, unlike a data mart, deals with multiple subject areas and is typically
implemented and controlled by a central organizational unit such as the corporate Information
Technology (IT) group. Often, it is called a central or enterprise data warehouse. Typically, a data
warehouse assembles data from multiple source systems.
Data marts are typically smaller and less complex than data warehouses; hence, they are typically
easier to build and maintain. Table A-1 summarizes the basic differences between a data warehouse and a
data mart.

Table A-1 Differences Between a Data Warehouse and a Data Mart


Category Data Warehouse Data Mart
Scope Corporate Line of Business (LOB)
Subject Multiple Single subject
Data Sources Many Few
Size (typical) 100 GB-TB+ < 100 GB
Implementation Time Months to years Months

Dependent and Independent Data Marts


There are two basic types of data marts: dependent and independent. The categorization is based
primarily on the data source that feeds the data mart. Dependent data marts draw data from a central data
warehouse that has already been created. Independent data marts, in contrast, are standalone systems built
by drawing data directly from operational or external sources of data, or both.

Page 71 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

The main difference between independent and dependent data marts is how to populate the data
mart; that is, how we can get data out of the sources and into the data mart. This step, called the
Extraction-Transformation-and Loading (ETL) process, involves moving data from operational systems,
filtering it, and loading it into the data mart.
With dependent data marts, this process is somewhat simplified because formatted and
summarized (clean) data has already been loaded into the central data warehouse. The ETL process for
dependent data marts is mostly a process of identifying the right subset of data relevant to the chosen data
mart subject and moving a copy of it, perhaps in a summarized form.
With independent data marts, however, we must deal with all aspects of the ETL process, much as
we do with a central data warehouse. The number of sources is likely to be fewer and the amount of data
associated with the data mart is less than the warehouse.
Steps in Implementing a Data Mart:
The major steps in implementing a data mart are to design the schema, construct the physical
storage, populate the data mart with data from source systems, access it to make informed decisions, and
manage it over time.
Designing
The design step is first in the data mart process. This step covers all of the tasks from initiating the
request for a data mart through gathering information about the requirements, and developing the logical
and physical design of the data mart. The design step involves the following tasks:
 Gathering the business and technical requirements
 Identifying data sources
 Selecting the appropriate subset of data
 Designing the logical and physical structure of the data mart
Constructing
This step includes creating the physical database and the logical structures associated with the data
mart to provide fast and efficient access to the data. This step involves the following tasks:
 Creating the physical database and storage structures, such as tablespaces, associated with the data
mart
 Creating the schema objects, such as tables and indexes defined in the design step
 Determining how best to set up the tables and the access structures
Populating
The populating step covers all of the tasks related to getting the data from the source, cleaning it up,
modifying it to the right format and level of detail, and moving it into the data mart. More formally stated,
the populating step involves the following tasks:
 Mapping data sources to target data structures
 Extracting data
 Cleansing and transforming the data
 Loading data into the data mart
 Creating and storing metadata

Page 72 of 73
R.M.K College of Engineering and Technology
CS6302-DATABASE MANAGEMENT SYSTEMS UNIT IV

Accessing
The accessing step involves putting the data to use: querying the data, analyzing it, creating reports,
charts, and graphs, and publishing these. Typically, the end user uses a graphical front-end tool to submit
queries to the database and display the results of the queries. The accessing step requires that you perform
the following tasks:
 Set up an intermediate layer for the front-end tool to use. This layer, the metalayer, translates
database structures and object names into business terms, so that the end user can interact with the
data mart using terms that relate to the business function.
 Maintain and manage these business interfaces.
 Set up and manage database structures, like summarized tables, that help queries submitted
through the front-end tool execute quickly and efficiently.
Managing
This step involves managing the data mart over its lifetime. In this step, you perform management
tasks such as the following:
 Providing secure access to the data
 Managing the growth of the data
 Optimizing the system for better performance
 Ensuring the availability of data even with system failures

-------

Page 73 of 73
R.M.K College of Engineering and Technology

You might also like