The Effect of Database Filters On The Pe
The Effect of Database Filters On The Pe
00
Printed in Great Britain Pergamon Press Ltd
Abstract-To reduce the load of CPU’s, database filters are installed on many database machines to filter
out irrelevant data from the mass storage devices. Furthermore, as CPU’s are usually much faster than
I/O devices, database buffers are also installed on many database systems to avoid additional physical
I/O operations. However, after the data pages have been filtered by the database filter, the buffer hit rate
may decrease and the performance of the database system degrades accordingly. In this paper, a queueing
model is proposed to study the effectiveness of the database filter. The result shows that the system
performance can be improved slightly by the database filter. A simulation model is proposed afterwards
to compare the performance of database systems with and without the database filter under the execution
of various Selection operations. The result concludes that the database filter can only improve the system
performance by a factor of 1.24 at most, and the system perfo~an~ actually degrades in most cases.
Based on the performance analysis, the use of the database filter in the buffered database systems is not
recommended.
1. INTRODUCTION
In conventional database management systems, data in the secondary storage devices must be
loaded into the main memory before the database operations can be performed. During the process
of loading, a large amount of irrelevant data may also be transferred into the main memory. System
becomes inefficient while dealing with these redundant data. The database filter, which has been
used in many database machines [l-17], is thus used to eliminate this inefficiency. An abstract
model of a system with a database filter is shown in Fig. 1. The database filter, which is installed
between the secondary storage device and the host computer, can on-the-fly filter out irrelevant
data from the secondary storage device. Therefore, the advantage of using the database filter is
to save the CPU power and the I/O channel capacity.
In the database filter, an output buffer is usually installed to store the temporary result data
[8-lo]. The input to the database filter is all the tuples of a relation. The output tuples that satisfy
the filtering criterion are then stored in the output buffer. The filtering criterion usually consists
of the selection predicates in the Selection operation and a list of projected attributes in the zyxwvut
Projection operation. When the output buffer is full, the tuples stored in it are either dumped into
the secondary storage device as a new relation or directly transferred to the host computer for
further processing.
Despite the popular use of database filters in many database machines, very few papers address
their performance [8,9]. [9] States the performance of the database filter but not the overall system
performance. In [8], the database filter has been implemented as an I/O device and communicates
with PC/AT through the PC/AT internal bus. The execution time of performing the Selection
operation on a relation is evaluated for both dBASEII1 and the database filter. The relations in
the hard disk are stored as the sequential files. The results show that the performance of the
database filter works five times better than that of dBASE111 in average.
The performance evaluation of the database filter conducted in [8] is based on the single-user
environment (i.e. only one database operation is executed in the system at one time), which is
different from the multi-user environment used in many computers. In the multi-user environment,
99
100 zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
JANG-JONGFANand KEH-YIH Su
Fig. 1. The abstract model for the system with a database filter.
the database filter cannot be monopolized by one Selection operation. It must be shared among
different Selection operations. Thus the effectiveness of the database filter might be different from
that in the single-user environment.
On the other hand, to reduce the number of the accesses of the physical disk pages, the database
buffer is usually installed in the database management system. As the disk I/O frequently becomes
the bottleneck in the database management system, the database buffer improves the system
performance greatly. To make the database buffer effective, it is important to keep the buffer hit
rate, defined as the probability of finding a desired disk page in the database buffer, as high as
possible. However, using the database filter to filter the data might decrease the buffer hit rate. The
reason is that the data of the disk page in the database buffer may be incomplete after filtering.
This filtered page therefore can not be reused by other queries unless the filtering criterion is the
same. The question is thus raised: considering the possible additional I/O accesses caused by the
database filter, is it still worthwhile using the database filter? A thorough analysis is made in this
paper to answer this question.
In this paper, a queueing model is first given to study the performance effectiveness of the
database filter under the multi-user environment. The result shows that the system performance
can be improved slightly by the database filter in the best case. A simulation is then conducted
for more detailed performance measurement. The simulation uses the synthetic workload composed
of various Selection operations. The simulation shows that the database filter can improve the
system performance by a factor of 1.24 at most, and the performance actually degrades in most
cases.
This paper is outlined as follows. Section 2 describes the queueing model of studying the
performance effectiveness of the database filter. A strategy of using the database filter is presented
in Section 3. The simulation and the result analysis are given in Section 4. Finally, Section 5 remarks
the conclusions.
In the database system with the database filter, only the tuples satisfying the filtering criterion
will be transferred to the database buffer. Therefore, a disk page in the database buffer may be
different from itself in the disk. The subsequent reference to the same disk page then requires an
additional physical disk access if the current filtering criterion is different from the previous one.
For example, the index nodes with higher level (especially the root node) will be frequently
reaccessed with different filtering criterion, and it will cause many additional I/O if those index
nodes are filtered each time. Due to this reason, the database filter may have negative effect on
the system performance. This section presents a queueing model to study the effect of the database
filter on the system performance under the multi-user environment.
The database filter can on-the-fly filter out the irrelevant information when the disk pages are
transferred from the disk to the database buffer. Thus, the query execution in the database system
with the database filter can reduce the data transmission time and the CPU processing time because
there are less data to be transferred and processed. Since the function of the database filter is
to filter the tuples according to the filtering criterion, only the Selection operation among
different database operations will be affected. The performance comparison between the database
systems with and without the database filter is thus conducted on the execution of the Selection
operation.
Database filters and buffered relational database systems 101
Disk server
Terminal servers
Fig. 2. The closed queueing model for the execution of the Selection operation in the multi-user
environment.
102 zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
JANG-JONGFAN and KEH-YIH Su
where p is i/(p * (1 - H)) and H is the buffer hit rate of the database system without the database
filter. When CPU keeps busy, the completion rate of the page processing is 1. The page processing
rate of the database system without the database filter zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQ
T is hence formulated as
T=p*U. (2)
Similarly, the page processing rate of the database system with the database filter T,, is calculated
by replacing l/p, l/1 and H with l/p,,r, l/& and Hdr, respectively, in equations (1) and (2). Here,
pd(dT
and &r denote the mean CPU processing and disk page access rates of the database system with
the database filter respectively, and HdT is the buffer hit rate of the database system with the
database filter.
The speed-up ratio, defined as T,,,/T, is used to measure the effect of the database filter on the
system performance. Consider the general case that p # 1, the speed-up ratio can be rewritten as
follows according to equations (1) and (2),
r n+l -
/I
L”
1
1
lI+l
u(1 - H) - u(1 -H) zyxwvutsrqponmlkjihgfedcb
c I- A 1
‘df
-
(3)
U
[ u(1 -H) 1
As presented in [13], the Selection operation on a 4 kbytes page requires l-2.5 msec CPU processing
time and 30 msec page access time. Those CPU processing and disk page access time are measured
from a very old and slow VAX 1l/750 minicomputer equipped with Fujitsu Eagle disk (the average
seek and latency time are 20 and 8 msec, respectively). The tuple length is 182 bytes. The selection
predicate includes 2 attributes, i.e. a 2-byte integer and a 52-byte string. With the current
technology, the capability of a CPU can reach few hundreds MIPS (Million Instructions per
Second). However, the current disk speed (the average seek and latency time are 14 and 8 msec,
respectively) is only slightly faster than Fujistu Eagle disk. Therefore, it is reasonable to assume
that the CPU processing rate p is much faster than the disk page access rate 1 in the model. In
addition, the buffer hit rate is rarely greater than 0.9 in actual applications. Thus, the terms
[i./u(l -H)]“” and [ndf/udf(l - Hdry+ ’ in equation (3) can thus be eliminated in the multi-user
environment. For example, given l/n = 30 msec, l/u = 1 msec, H = 0.9 and n = 4, then n/,u = 0.33
and [%/u( 1 - H)r + ’ = 0.004. Similar argument can be made on the term [&/l(&( 1 - HdF)]“+‘. The
equation (3) is then reduced to
i.&-(1 - H)
(4)
1(1 - HdT).
From equation (4), the maximum speed-up ratio is ldr/l. when both database systems have the
same buffer hit rate (i.e. Hdt = H). In order to release the I/O channel for the use of other devices
during the period of moving the disk head, a local buffer is usually installed in the disk controller
to buffer the disk page. The value of 1 is then equal to l/[( Txek + Tatcncy+ Tpagc)+ T,,,o], where T,, ,
Tlatency 3 Tpagcand T,!. denote the seek time, the latency time, the page reading time and the
3 Ttram~er
I/O channel transmission time, respectively. Since the database filter only reduces the I/O channel
transmission time from the local buffer of the disk controller to the system main memory, the value
of ).dT is equal to l/I(T,k + Tatency + T,,) + Tlio * P,], where Pr is the filtering factor, which is
defined as [(the portion of a disk page to be transferred)/(the total page size)]. When Pr= 0, the
maximum speed-up ratio is obtained with the value of
1
” = CT,, + T,atency
+ T,, ) + TPO
(5)
1 CT,, + Tatmy+ Tpage) ’
If the disk access time [i.e. (T,, + T,s,cncy+ T,,,)] dominates, the maximum speed-up ratio
approximates to 1. If the I/O channel transmission time (i.e. T,;,) dominates, the maximum
speed-up ratio increases as the I/O channel transmission time increases.
To get a practical range of the maximum speed-up ratio, the following parameters are plugged
in equation (5). The size of the disk page is assumed to be ranged from 512 to 4096 bytes [14].
Consider the current disk technology, the average seek and latency time can be 14 and 8 msec,
respectively, and the transfer rate can reach 3 Mbytes/set (e.g. FH-3000 x series with the SCSI
Database filters and buffered relational database systems 103
interface [l S]). The current l/O channel usually has 10 Mbytes/set transmission rate. Based on these
assumptions, the maximum speed-up ratio are estimated as 1.002 and 1.017 when the size of the
disk page are 512 and 4096 bytes, respectively.
Finally, we conclude this section with some remarks.
1. In the past, as the CPU speed was not much faster than the disk speed, using the database
filter to reduce the CPU load greatly improved the system performance. However, as the
current CPU speed is much faster than the disk speed and the database filter only reduces
the I/O channel transmission time, the database filter no longer plays an important role for
improving the system performance.
2. From equation (4). the buffer hit rate of the database system with the database filter must
be greater than / I \
c1+(1-H)
1
to make the speed-up ratio greater than 1. The lower bound of Hdf [i.e. (1 - &,/A( 1 - H))]
decreases as the page size increases (as the page size increases, r,,, increases), as shown in
equation (5). However, given the page size be 4096 bytes, the lower bound is
(1 .017 * H - 0.0 17). Therefore, the buffer hit rate of the database system, with the database
filter, must decrease no more than (0.017 - 0.017 * H) to make the speed-up ratio greater
than 1. When N = 0, the maximum decrease of the buffer hit rate is only 0.017. The
observation is that a little decrease of Hdf will make the speed-up ratio to be less than 1.
Since the advance of the disk speed is far behind the advance of the CPU speed, the maximum
allowable decrease of the buffer hit rate will even be smaller as the semiconductor technology
pushes further.
3. zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
THE STRATEGY OF USING THE DATABASE FILTER WITH THE
DATABASE BUFFER
As discussed in the previous section, a little decrease of the buffer hit rate will cause the speed-up
ratio to be less than 1. A strategy of using the database filter is thus proposed in this section to
make the decrease of the buffer hit rate as little as possible.
Basically, the database filter can be used in a more effective way if the following principle is
obeyed: if a disk page will be reused in the near future, the disk page should not be filtered.
Otherwise, it can be filtered.
Since the reference patterns of relational database operations are usually regular and predictable,
they can be examined in advance to determine the usage of the database filter. Depending on the
file structure where the relation is stored, two access methods, which are sequential search and index
search, are commonly used to perform the zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJI
Selecrion operation. Once the access method is
determined, the use of the database filter can be set as follows.
1. If a relation is stored as a sequential file, each data page of the file is thus sequentially
scanned to perform the Selection operation. Since each page will not be reused after it is
released, it can be filtered without decreasing the buffer hit rate.
2. If an index tree (e.g. B+-tree (161) is available, the Selection operation can be performed
by a few index scans. In each index scan, the index pages (i.e. the disk pages contain the
index entries) in the index tree are first searched down from the root level, then followed
by searching the data pages (i.e. the disk pages contain the tuples) on the leaf level. Since
each data page is not likely to be reused, it can be filtered. However, the other index pages
are more likely to be reused ~especiaIly the root index page that will be reused for each index
scan), they should not be filtered.
4. THE SIMULATION
The queueing model given in Section 2 concludes that the maximum speed-up ratio is only
slightly greater than 1. It also shows that a little decrease of the buffer hit rate will make the
104 zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
FAN and KEH-YIH SIJ
JANG-JONG
speed-up ratio to be less than I, However, it does not show how much the buffer hit rate will
decrease (or how much the system performance will degrade) when the database filter is installed
in the database system. Besides, it is not easy to accurately model the behavior of the database
buffer. Although directly measuring the performance of an existing system is feasible, it is too
expensive. Therefore, the computer simulation is a suitable choice to answer those questions based
on the trade-off between the cost and the accuracy. This section first describes the simulation model
and then discusses the performance effectiveness of the database filter based on the simulation
results. Finally, the simulation result is used to verify the correctness of the queueing model
proposed in Section 2.
Uo channel
Diskserver server
Buffer M~ger CPU server
Te&al servers
Fig. 3. The simulation model for the query execution in the multi-user environment.
Database filters and buffered relational database systems 105
systems with and without the database filter is the operation of the buffer manager. The buffer
manager with the database filter first checks if the referenced page is in the database buffer. If yes,
it then checks if the current filtering criterion on the referenced page is the same as the previous
one on the same page in the database buffer.
In order to generate the sequence of events for each query, a C program is implemented on SUN
3/160 to execute the query. During the query execution, the triplet or quadruplet, depending on
whether the database filter is used, is recorded for each page reference. This C program is designed
to count the total number of MC68020 assembly instructions executed during the processing of
the referenced page. The CPU time to process the referenced page is then determined by summing
the execution time of these assembly instructions according to the inst~~tion timing table of
MC68020 (171. The IjO channel transmission time is determined by the size of the data to he
transferred times the I/O transmission rate.
Other assumptions and parameters used in the simulation model are outlined as follows. These
parameters are set to be as realistic as possible.
1. Since the global LRU replacement policy is widely used in many commercial database
management systems (e.g. INGRES and System R), the buffer management policy is
assumed to be the global LRU replacement policy 1141.Shared read and exclusive write are
permitted for each buffer frame in the buffer pool. The size of a buffer frame is equal to
that of a disk page which is assumed to be 4 Kbytes. The size of the database buffer is set
to I Mbytes, which is about half the size of the accessed relation. This buffer size is adopted
to avoid the buffer accommodating an entire relation.
2. The clock rate of CPU is assumed to be 20 MHz. The scheduling algorithm for using CPU
is assumed to be round-robin.
3. The disk scheduling algorithm is LSTF (Least Seek Time First). The timing specifications
of the disk drive are taken from FH-3000 disk drive 1151.The seek time is dete~ined by
the distance between the cylinder number of the referenced page and that of the current
disk head position. The latency time is assumed to be uniformly distributed between zero
and one rotation time, which is 16msec.
4. The transmission rate of the I/O channel is assumed to be 10 Mbytes/set.
5. The setup time of the database filter is ignored, because the setup time is usually quite small
compared to the disk access time.
6. The overhead for buffer management, disk schedule, processor schedule and I/O setup time
are ignored due to the fact that these overhead are usually much smaller than the execution
time of a query.
4.2. zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
Work load synthesis
As addressed by Boral [I 81, the multiprogramming level, the query mix and the degree of data
sharing are three major factors that affect the database performance in a multi-user environment.
The number of concurrent queries, i.e. the multiprogramming level, in the following simulation
varies from 1 to 8. Three levels of data sharing are defined in our simulation:
The synthetic relations used in our simulation are based on the one proposed in 1191. The
advantage of adopting these synthetic relations is that the selectivity factor, defined as the ratio
of the total number of the tuples that satisfy the selection predicates to that of the tupies in the
relation, can be easily dete~ined by selecting different attributes. Two basic relations, namely
RelationA and RelationB, are designed as follows for our simulation. The size of each relation is
set to 10,000 tuples. Each tuple is 212 bytes long and consists of a number of integer and string
attributes as shown in Table 1. The domains of attributes are designed in such a way that various
selectivity factors can be obtained. For example, each attribute value of the attribute “uniquely’
is unique, and the domain of the attribute “two” contains only 2 different attribute values. For
the detailed description of the tuple format, reader may refer to 1201.
106 JANG-JONG zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGF
FAN and KEH-YIH Su
To compare the performance of the database systems with and without the database filter, the
following two queries which consist of one Selection operation are designed. The first query
represents a sequential search and the second one represents an index search. To improve the system
performance with the database filter, the selectivity factor is set as low as possible. In the multi-user
environment, many users may simultaneously access the same relation with different selection
predicates. To simulate this situation, the searching key keyval in each query is generated by a
random variable which is uniformly distributed over (0,9999). In Query II, the B+-tree [12] is
constructed on the attribute ‘“uniquel” of RelationB.
With the simulation model described above, the performance of the database systems with and
without the database filter are compared. Two strategies of using the database filter are adopted
in the simulation. The first strategy follows the principle described in Section 3. The second strategy
uses the database filter all the time regardless of whether it is an index page or a data page.
When Query I is executed, the speed-up ratio and the buffer hit rate are shown in Fig. 4 and
Table 2 respectively. Since RelationA is not indexed with any atribute, the Selection operation is
performed by sequential scan on all data pages. In this case, the same results are obtained using
either the first or the second strategy. As expected by the queueing model depicted in Section 2,
the speed-up ratio is slightly greater than 1 when the buffer hit rate does not decrease (i.e. in the
case of Level 1 data sharing). When data sharing happens (i.e. in the cases of Level 2 and Level
3 data sharing), the speed-up ratio is less than 1 because of the decrease of the buffer hit rate as
shown in Table 2. The database filter can only improve the system performance by a factor of 1.24
Table 2. The buffer hit rate in execution of Query I (search sequential file)
Number of Level 1 data sharing Level 2 data sharing Level 3 data sharing
concurrent
queries zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
H &if H f &l H f f di
?t =l 0.~ ~ 0.~
n=2 o.ooC@Oo 0.000000 0.495288 O.oowoO 0.495288 0.000000
n=4 0.000000 0.000000 0.271123 0.000000 0.695287 0.000000
n=6 0.000000 0.000000 0.191099 0.000000 0.805711 0.000179
?I=8 0.000000 0.000000 0.174714 0.000000 0.8665 I3 0.000000
Database filters and buffered relational database systems 107
O_________‘“‘----‘--f)
__-- __--
___---
c-
*.o-‘----
_A’
_/
.’
I”
c’
arc::_
--__
---___
--__,
*-__
--__
--__
--%_
--__ -A-----_______
---‘----A
2 3 4 5 6 I 8
Number of concu~ent queries
Fig. 4. The speed-up ratio in execution of Query I (search sequential file).
in the best case (i.e. in the case of the single-user environment without data sharing). In most cases,
using the database filter degrades the system performance.
As shown in Fig. 4, when the number of concurrent queries increases, the speed-up ratio increases
in the case of Level 2 data sharing, and it decreases in the case of Level 3 data sharing. This can
be expected from the buffer hit rates as shown in Table 2. When the number of concurrent queries
increases, the difference between H and Hdf decreases in the case of Level 2 data sharing, but the
difference increases in the case of Level 3 data sharing. Since every 2 queries simultaneously access
the same relation in the case of Level 2 data sharing, increasing the number of concurrent queries
will increase the number of different relations stored in the database buffer. Thus, the probability
for each query to find a desired disk page in the database buffer would decrease. In the case of
Level 3 data sharing, all the queries simuhaneously access the same relation. Because each page
will be reaccessed by many queries in the near future and the global LRU replacement policy is
adopted, the buffer hit rate would increase as the number of concurrent queries increases.
Table 3 shows the speed-up ratio measured from the simulation and that estimated by the
queueing model. The estimated speed-up ratio is calculated from equation (4) in Section 2 with
the values of H and Z& obtained from Table 2. The estimation error of the speed-up ratio from
the queueing model is less than 10% in the multi-user environment. This again verifies the rest&s
predicted by the queueing model.
When Query II is executed, the speed-up ratio and the buffer hit rate are shown in Fig. 5 and
Table 4 respectively. The similar results as those of the execution of Query I can be found (i.e. when
Table 3. The comparison of speed-up ratios measured from the simulation and that estimated by the queueing model in execution of
Ouerv 1
0% data sharing 50% data sharing 100% data sharing
Number of
concurrent Measured from Estimated by the Measured from Estimated by the Measured from Estimated by the
queries the simulation queue& model the simulation queueing model the simulation queneing model
n=l 1.236 1.017
n=2 I .027 1.017 0.449 0.493 0.449 0.493
n=4 1.001 1.017 0.716 0.741 0.341 0.309
n=6 I .002 1.017 0.81 I 0.823 0.203 0.197
n=8 0.994 1.017 0.831 0.839 0.148 0.135
108 JANG-JONGFAN and KEH-YIHSu
2 3 4 5 6 I 8
Number of concurrent queries
Fig. 5. The speed-up ratio in execution of Query II (search the index file).
the buffer hit rate decreases no more than 0.017, the speed-up ratio approximates to I; otherwise,
the speed-up ratio is less than 1). Besides, each entry in the column labeled as “II,,;’ in Table 4
contains 2 values. The upper value indicates the buffer hit rate using the first strategy, and the lower
one indicates the buffer hit rate using the second strategy. Since the locality property exists in
accessing the index pages, the second strategy, which filters the index pages, has a lower buffer hit
rate, and its performance is worse than that of the first strategy.
Based on the simulation results presented above, the same conclusion as predicted by the
queueing model is obtained. In addition, the simulation results show that the buffer hit rate actually
decreases greatly and the speed-up ratio degrades a lot in the case of data sharing. Although the
first strategy prevents the decrease of the buffer hit rate when the index pages are accessed, it does
not successfully prevent the decreases of the buffer hit rate when the data pages are accessed in
the case of data sharing. Therefore, the reaccess to the filtered page stored in the database buffer
usually causes an additional physical disk access when the filtering criterion is different. Although
the database filter can improve the system performance by a factor of 1.24 in the single-user
environment without the data sharing, it degrades the system performance in the case of the data
sharing. The database filter is not recommended to use in the database system with the database
buffer.
5. CONCLUSIONS
This paper presents a queueing model and a simulation model for studying the effect of the
database filter on the buffered relational database system. Since the CPU speed is much faster than
the disk speed, both the queueing and simulation models show that using the database filter does
not improve the system performance in the multi-user environment. The simulation result also
shows that using the database filter can improve the system performance by a factor of 1.24 in the
best case, which occurs in a single-user environment without data sharing. When data sharing
exhibits in the query execution, the buffer hit rate may decrease since the filtered page needs to
be reaccessed in the near future. Therefore, the database filter indeed degrades the system
performance as shown by both the queueing model and simulation model. It is then concluded that
the database filter should not be used in the database system which has the database buffer.
REFERENCES