0% found this document useful (0 votes)

26 views8 pages

Qos Management of Real-Time Data Stream Queries in Distributed Environments

Uploaded by

Sana Hamdi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views8 pages

Qos Management of Real-Time Data Stream Queries in Distributed Environments

Uploaded by

Sana Hamdi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

QoS Management of Real-Time Data Stream Queries in Distributed

Environments

Yuan Wei, Vibha Prasad, Sang H. Son

Department of Computer Science
University of Virginia, Charlottesville 22904
{yw3f, vibha, son}@cs.virginia.edu

Abstract size of the intermediate results and final query results change,
even when the input volume remains static. In distributed en-
Many emerging applications operate on continuous un- vironments, the changing intermediate result size also affects
bounded data streams and need real-time data services. the communication cost (CPU time and network bandwidth)
Providing deadline guarantees for queries over dynamic associated with intermediate result propagation.
data streams is a challenging problem due to bursty stream To address these issues, we proposed a prediction-based
rates and time-varying contents. This paper presents a QoS management algorithm [19], which uses online profiling
prediction-based QoS management scheme for real-time and sampling to estimate the cost of the queries on dynamic
data stream query processing in distributed environments. streams. The profiling process is used to calculate the aver-
The prediction-based QoS management scheme features age cost (CPU time) for each operator to process one data
query workload estimators, which predict the query work- tuple. The sampling process is used to estimate the selectiv-
load using execution time profiling and input data sampling. ity of the operators in a query. In this paper, we have used
In this paper, we apply the prediction-based technique to se- this prediction-based QoS management in a distributed en-
lect the proper propagation schemes for data streams and vironment to select the proper propagation schemes for data
intermediate query results in distributed environments. The streams and intermediate query results. To our best knowl-
performance study demonstrates that the proposed solution edge, this is the first work that predicts query workload and
tolerates dramatic workload fluctuations and saves signifi- used workload predictions to control query QoS in a dis-
cant amounts of CPU time and network bandwidth with little tributed environment.
overhead. The rest of the paper is organized as follows: Section 2
gives an overview of the system model and our assumptions.
Section 3 describes the prediction-based QoS management
scheme. In section 4, we describe how this scheme can be
1. Introduction
used in a distributed environment. Section 5 presents the
performance evaluation and experimental results. In section
Many applications need to operate on continuous un- 6, we discuss the related work and section 7 concludes the
bounded data streams. The streaming data may come from paper.
sensor readings, internet router traffic trace, telephone call
records or financial tickers. Many of these new applications
have inherent timing constraints in their tasks. However, due 2. System Model and Assumptions
to the dynamic nature of data streams, the queries on data
streams may have unpredictable execution cost. First, the A data stream is defined as a real-time, continuous, or-
arrival rate of the data streams can be unpredictable, which dered (implicitly by arrival time or explicitly by timestamps)
leads to variable input volumes to the queries. Second, the sequence of data items [8]. A Data Stream Management
content of the data streams may vary with time, which causes System (DSMS) is a system especially constructed to pro-
the selectivity (Sel) of the query operators to change over cess persistent queries on dynamic data streams. DSMSs
time. The selectivity of an operator is defined as: are different from traditional database management systems
Sel = size(output)/size(input) (DBMSs) in that traditional database management systems
It measures the fraction of data input that passes the cur- expect the data to be persistent in the system and the queries
rent operator to the next. As operator selectivity varies, the to be dynamic whereas DSMSs expect dynamic unbounded
data streams and persistent queries. Due to the high volume Final
Results

of data streams, it is often assumed that it is not possible to

store a stream in its entirety, nor is it feasible to query the Aggregate
Op
whole stream history. Typically, the queries are executed on (avg)

a window of data. A window on a data stream is a segment of Join

Results
a data stream that is considered for the current query. Emerg-
ing applications, such as sensor networks, network traffic Join
analysis, intelligent traffic management, and financial mar- Data
Window Op

ket analysis, have brought research related to data streams in Range Relation
focus. These applications inherently generate data streams Window
Op
and DSMSs are well suited for such applications. Relation Lanes

Data Stream Speed

2.1. Query Execution
Figure 1. An Example Query Plan
In a DSMS, the system contains long-running and persis-
system requirements. In our system, the queries are sched-
tent queries. We have assumed a periodic query model [20]
uled using the Earliest-Deadline-First scheduler according to
where every query has an associated period and the query
their deadlines.
instances are generated periodically to process the incoming
stream data. In our prototype system, RTStream [20], all the
queries are pre-registered in the system, and are converted 2.2. Assumptions
to query plans (containing operators, queues and synopses)
statically before the system starts. Queues in a query plan, In this paper, we assume real-time requirements of the
store the incoming data streams and the intermediate results system are soft, which means that a small number of queries
between the operators. A synopsis is associated with a spe- missing their deadlines will not lead to system failure. This
cific operator in a query plan and it stores the accessory data assumption is necessary as the system is dealing with unpre-
structures needed for the evaluation of the operator. For ex- dictable data streams and some queries might not meet their
ample, a join operator may have a synopsis that contains a deadlines when the system is overloaded. We also assume
hash join index for each of its inputs. When the join opera- the quality of the queries can be traded off for timeliness by
tor is executed, these hash indices are probed to generate the dropping some of their inputs.
join results. Example data stream and query specifications To handle the workload for high data stream query pro-
are given as follows: cessing, all data structures used by the query need to be
Stream : Speed(int lane, f loat value, char[8] type); stored in physical memory for better performance. In this pa-
Relation : Lanes(int ID); per, we assume that the system has enough physical memory
Query : Select avg(Speed.value) to store data tuples from the input data streams, intermedi-
F rom Speed[range1minute], Lanes ate results and accessory data structures (e.g., indices). This
W here Speed.lane = Lanes.ID requirement is easy to satisfy given today’s memory chip ca-
and Speed.type = T ruck; pacity. For example, with 512M physical memory, the sys-
P eriod 10 seconds tem can maintain 100 queries on high rate data streams (1000
Deadline 5 seconds tuples/second per stream) with average window size of 10
The query above operates on data streams generated by seconds.
speed sensors and calculates the average speed of trucks in
particular lanes during the last 1 minute. The query needs to 2.3. Example Application
be executed every 10 seconds and the deadline is 5 seconds
after the release time of every periodic query instance. The An example application which matches the assumptions
query plan generated is shown in Figure 1. The query plan in this paper is the intelligent road traffic management sys-
is made up of three query operators (range window operator, tem [13]. The system periodically computes the latest traffic
join operator and aggregate operator) and two queues (one statistics of different road segments and then sends them to
for storing the range window output and one for storing the a traffic simulator to calculate the road signal controls and
join output). the best routes for drivers. Obviously, late results are not
After the query plan is generated, the operators are sent to acceptable in this case as the traffic simulator needs to have
the scheduler to be executed. Depending on the query model information from all the road segments to make accurate pre-
(e.g., continuous or periodic), a scheduling algorithm (e.g., dictions. At the same time, if the results obtained are approx-
round-robin or Earliest-Deadline-First) is used to meet the imate (in this case, traffic statistics), the data is still useful for
Operator Avg Cost Depends Depends
micro s/t on on
For the selection operator Osel :
micro s/t Sel? Syn Size?
Selection 0.16 - 0.3 Yes No The number of output tuples = n × s
Projection 0.16 - 0.2 No No
Join 3-6 Yes Yes The time for evaluating all input tuples = n × Cp
Stream Join 3 - 100 Yes Yes The time for inserting output tuples = n × s × Ci
Distinct 0.16 - 0.3 Yes Yes
Except 0.16 - 0.3 Yes No The total time = n × Cp + n × s × Ci
n×Cp +n×s×Ci
Group Aggregation 0.16 - 0.6 Yes Yes The average cost per data tuple = n
Partition Window 0.16 - 0.3 No No
Range Window 0.2 - 0.3 No No = Cp + s × Ci

Table 1. Operator Cost and Dependency on The costs Cp and Ci are expected to be fairly constant for a
Selectivity and Synopsis Size particular set of predicates.

the traffic simulator as the system still gets some information 3.1.2 Join Operation Cost Analysis
about the traffic.
For join operations, Symmetric Hash Join (SHJ) [21, 10] is
used. SHJ works by keeping a hash table for each input in
3. Prediction-based QoS Management memory. When a tuple arrives, it is inserted in the hash table
for its input and it is used to probe the hash table for the other
Three parameters are needed for each query to perform input. This probing may generate join results which are then
prediction-based QoS management, namely, the input data inserted in the output buffer. The following notations are
stream volume, the operator selectivity and the execution used for a join operator Ojoin :
time per data tuple for each operator. Since our approach
• the left and right input volume, nL and nR
only considers queries that are ready to execute, the input
data volumes are known.
• the selectivity of the operator, s

3.1. Query Execution Time Estimation • the execution time to probe the left and right hash in-
dices, CLP robe and CRP robe
Table 1 shows the average execution time per tuple of the
operators and their dependency on the current operator se- • the execution time to hash left and right input, nLHash
lectivity and synopsis (e.g. indices) size. The prediction- and nRHash
based QoS management scheme assumes that the incoming
• the execution time to insert the output tuple to buffer,
data streams, the intermediate results and the accessory data
Ci
structures are all stored in memory. Hence the time taken for
one operator to process a data tuple can be estimated effec- For the join operator Ojoin :
tively without considering any additional overhead of access-
ing data from the disk. Here, we briefly discuss the analysis
The number of output tuples = nL × nR × s
for selection and join operators. The reader is referred to [19]
The time for processing left input tuples
for a detailed discussion.
= nL × CRP robe + nL × CLHash
The time for processing right input tuples
3.1.1 Selection Operation Cost Analysis = nR × CLP robe + nR × CRHash
The time for inserting output tuples = nL × nR × s × Ci
The following notations are used for a selection operator The total time
Osel : = nL × (CRP robe + CLHash ) + nR × (CRP robe
+CLHash ) + nL × nR × s × Ci
• the input tuple volume, n
Of the three type of cost factors, hashing cost (CLHash
• the selectivity of the operator, s and CRHash ) and insertion cost (Ci ) are much smaller than
the probing cost (CLP robe and CRP robe ) and generally re-
• the execution time to evaluate the predicates, Cp main constant for a particular system. The probing cost,
however, depends on the contention rate of the hash index,
• the execution time to insert the output tuple to buffer, which in turn depends on the input data volume and allocated
Ci index size.
3.1.3 Cost Constant Calculation Using Profiling example of convex QoS curve) drops slowly with input drop-
ping ratio. A query with convex QoS curve can still calculate
The exponential smoothing algorithm is chosen to give rel-
the average value reasonably well when a small percentage
atively higher weights on recent observations in forecasting
of input data tuples is dropped. The quality of Q1 (an exam-
than the older observations. Suppose that after a query in-
ple of linear QoS curve) drops linearly as input data dropping
stance is executed, the value for the cost parameter C is com-
ratio. On the other hand, Q2 (an example of concave QoS
puted to be Cnew , then C is updated using the single expo-
curve) is the opposite of Q0 as it can not tolerate dropping
nential smoothing formula:
any input data tuples. As an example, if the QoS of the target
C = C × (1 − α) + Cnew × α 0<α<1 system is set to 70%, it translates into dropping 75%, 30%
and 3% of the input data for Q0, Q1 and Q2 respectively.
3.2. Selectivity Estimation Using Sampling These ratios are called drop ratios and denote the fraction of
input tuples dropped.
In the prediction-based approach, a sampler query plan is The algorithm for inter-query QoS negotiation is an it-
constructed for every query in the system. The query plans erative process which keeps dropping the query QoS from
for the sampler queries are exactly the same as their corre- 100% and calculates the cost of all active queries [19].
sponding real queries’ plans. When a query instance is re- The query QoS for which all the queries finish their execu-
leased to the scheduler, the sampler is executed first with tion before their deadline is chosen and the corresponding
sampled data tuples from the input. The data tuples are sam- pseudo-deadlines are calculated for the queries. The pseudo-
pled from the real input according to preset sample ratio Sr . deadlines are assigned proportional to the estimated time of
The sampling process selects a simple random sample with- the queries.
out replacement. The execution of this sampler query plan is
used to estimate the selectivity and hence the execution time 3.4. Intra-Query QoS Refinement
of the operators in the query plan.
Quality Q1
In inter-query QoS management, every query instance is
Q0
Input % Quality
0.9 0.98
Input % Quality
0.9 0.9
assigned a pseudo-deadline, based on the estimated execu-
1.0
0.8
0.7
0.96
0.95
0.8
0.7
0.8
0.7
tion time for the query. The query instances now perform
0.6
0.5
0.92
0.87
0.6
0.5
0.6
0.5 intra-query refinement to fulfill their pseudo-deadlines in-
Q0
Current
QoS
0.4
0.3
0.82
0.75
0.4
0.3
0.4
0.3 stead of their actual deadlines. Before a query starts execut-
0.2 0.65 0.2 0.2
0.1 0.5 0.1 0.1 ing, it drops a fraction of the input data if the estimated ex-
Q1
Input/Quality Tables
Q2
Input % Quality
ecution time of the query exceeds the pseudo-deadline. The
0.9
0.8
0.45
0.3
data tuples are dropped early in the query plan as doing so
Q2
0.7
0.6
0.2
0.14 yields the best system utility [1]. The scheme used for deter-
0.5 0.09
0.4 0.05 mining the drop amounts is discussed in [19]. Furthermore,
0.3 0.03
Input Data Drop
0.2
0.1
0.02
0.01
the progress of the query is monitored periodically to ensure
0
0.03 0.3 0.75 1.0
Ratio
that the query meets its pseudo-deadline. If the query is run-
ning late, data tuples are dropped during execution to ensure
Figure 2. Inter-Query QoS Management with that the query meets its deadline. Finally, for each query, the
Query Quality Curves total number of tuples dropped during its execution are cal-
culated. Then, the input/quality table for the query is used to
find the quality of the query result. This query quality is used
3.3. Inter-Query QoS Management as a measure of system performance. The ultimate objective
is to maximize the average quality of the query results.
Inter-query refinement is performed to ensure that all
query instances get a fair chance to meet their deadline. A
pseudo-deadline is assigned to the queries which are ready 4. QoS Management in a Distributed Environ-
to execute. This pseudo-deadline is based on the estimated ment
execution time and the input/quality table for each query.
The input/quality table is a user-defined table which maps The prediction-based approach performs very well in
the fraction of input tuples used to the quality of the query maintaining query QoS in a centralized environment [19].
results. In this section, we discuss our dynamic data stream propaga-
As illustrated in Figure 2, three queries, Q0, Q1 and Q2, tion algorithm in distributed DSMSs. In distributed DSMSs,
have different requirements in terms of maintaining query one of the key problems is to reduce data stream propagation
quality when system is overloaded. As reflected by the in- cost. Data stream propagation not only consumes network
put/quality table and the curve, the query quality of Q0 (an resources, but also consumes a fair amount of CPU time.
Node B

Join
Op2

Selection or Join
Projection Op1
Node A
Transmitting Op0 Data Stream S2
Unprocessed
Data Streams
Relation R
Data Stream S1 Data Stream S1
Replica

Figure 3. Transmitting Unprocessed Data Streams

Node B

Join
Node A Op2

Selection or Transmitting Join

Projection Filtered Op1
Op0 Data Streams Data Stream S2

Filtered Data Relation R

Data Stream S1 Stream S1

Figure 4. Transmitting Filtered Data Streams

From a previous study [11], the network operation costs as ator at remote node for operators like the join operator Op1.
much as 35% of CPU time. However, this cost can be re- The sampler operator samples a small number of data tuples
duced if the remote node has the ability to choose whether from the incoming data streams and processes the join op-
to transmit unprocessed data streams or to process these data eration. If the output size is significantly smaller than the
streams and transmit the intermediate results (which are gen- input size, the data stream source node should process the
erally smaller in size). join operator locally and transmit the intermediate results to
Suppose Node B contains a data stream query that oper- the remote node.
ates on a local stream S2, a local relation R and a remote
stream S1 from node A. As shown in Figure 3, new data
tuples in S1 are sent to node B when they arrive at node A
and the data stream S1 is replicated at node B. One sim-
ple optimization is to move the obviously selective operators
such as Op0 to the remote node A and execute these opera- Sampled Selectivity > 0.5
97.5% Confidence
tors at the remote node. As shown in Figure 4, the operator
Op0 can be moved to node A to reduce the network work-
0.36 0.5 0.64
loads. We call these obviously selective operators, filters, as Selectivity

their outputs are always smaller in size compared to their in-

Figure 6. Sampling Result Selectivity Distribu-
puts. However, there are more aggressive approaches to save
tion
transmission cost. As shown in Figure 5, the system repli-
cates relation R from node B to node A and executes the join Once the sampling results are generated by the sampler,
operator Op1 at node A. If the output size of the join oper- the system determines whether the current operator selec-
ator Op1 is smaller than its input size, the communication tivity is small or large enough to switch data stream prop-
cost can be further reduced by transmitting the intermediate agation strategy. A simple hypothesis testing can be used
results from the join operator that are smaller in size. to decide whether the real output size is larger than the in-
However, unlike the filter operators, the join operators put data stream size with certain confidence value. For ex-
may produce outputs that are larger in size compared to their ample, suppose that for the output size of operator Op1 to
input. To address this problem, we use the prediction-based be larger than the data stream input, the operator selectiv-
technique to predict the input/output size relation and deter- ity needs to be higher than 0.5. In the sampling process, we
mine whether to transmit unprocessed data streams or pro- select 50 tuples from 1000 data tuples in stream S1. If the
cess intermediate results. The system creates a sampler oper- selectivity of the original 1000 data tuples is 0.5, the selec-
Node A Node B
Transmitting
Selection or Join Intermediate
Projection Op1 Results Join
Op0 Op2

Relation R
Replica
Data Stream S1
Intermediate Results Data Stream S2

Figure 5. Transmitting Intermediate Results

Parameter Value
tivity values from the random sampling follow the binomial Packet Sending Overhead 5 Microseconds
distribution. A binomial distribution with sample size n and Packet Receiving Overhead 10 Microseconds
success probability p approximates normal distribution for Tuples per packet 10
Sampler Operator Overhead 3 Microseconds per tuple
large n and p not too close to 1 or 0 (statistics [14] recom- Data Stream Rate 1000 tuples/sec
mends using this approximation only if np and n(1 − p) are Segment Size 1 second
Significance Level 15%
both at least 5. Otherwise, a continuity correction should
be applied). Since we sample 50 times out of the pool, the Table 2. Distributed DSMS Simulation Settings
distribution of the sample selectivity mean approximates the
normal distribution
√ with mean 0.5 and standard deviation
δ = 0.5/ 50 = 0.0707. As shown in Figure 6, according proper transmission strategy is determined accordingly. As
to the normal distribution property, if the sampled selectivity shown in Figure 7, at the receiving node, unprocessed data
is higher than 0.5 + 0.0707 ∗ 2 = 0.64, the current opera- stream segments are processed first and then the results are
tor selectivity is higher than 0.5 with 97.5% probability. In assembled together with the received intermediate results.
terms of hypothesis testing, the hypothesis that the selectivity The data transmitted needs to be marked up properly so that
value is less than or equals to 0.5 is called the null hypothe- it can be assembled correctly.
sis and the region where the selectivity value is higher than
0.64 is called the rejection region. This is because, if the 5. Performance Evaluations
sampled selectivity falls in that region, the null hypothesis is
very unlikely and so it can be rejected. As the confidence We conducted a set of experiments to test the performance
value increases, the rejection region shrinks and vice versa. of the prediction-based algorithm in reducing the network
workloads. In order to evaluate the performance of our ap-
Node B proach, we developed a Java simulator based on hypothesis
Join
testing. The simulation settings are shown in Table 2. The
Op2 overheads for sending and receiving a packet are set as 5 and
Stream 10 microseconds. According the Linux kernel research re-
Segments
Assembly
port [7], in Linux kernel 2.6.9, it takes 6 microseconds to
Intermediate
Results
+
Data Stream S2
send a packet and 17 microseconds to receive a packet using
From Node A TCP/IP. The overheads are reduced in the simulation settings
to reflect the technology advance. Each packet is set to con-
Selection or Join
Op1
tain 10 data tuples. As the experimental results shown in
Projection
Op0 Table 1, the sampler operator overhead is set at 3 microsec-
Unprocessed onds per tuple for join operator. The data stream arrival rate
Data Stream
From Node A Relation R is set at 1000 tuples per second and each data stream sam-
Data Stream S1
Replica
pling segment contains tuples within one second. The sig-
nificance level for the hypothesis test used in the algorithm
Figure 7. Assembly Data Stream Segments is set at 15%. A significance level of 15% means that the
algorithm chooses to transmit intermediate results instead of
unprocessed data streams only when the sampled results in-
To be able to switch between transmitting unprocessed dicate that the intermediate results are smaller than the input
data streams and transmitting processed intermediate results, with 85% or higher confidence. To evaluate the effectiveness
the receiving node (in our example, node B) has to be able of the proposed algorithm, we also show the performance of
to accept both types of data and use them to process stream the ideal algorithm. The ideal algorithm is marked as the Or-
query together. This is implemented by dividing the data acle algorithm, which always uses the best propagation strat-
stream to segments by either timestamps or sequence num- egy but incurs no overhead. In this section, the results shown
bers of data tuples. Each segment is sampled separately and in the graph are based on at least 10 simulation runs and the
95% confidence interval is within 5% of the value shown in time used by the prediction-based algorithm to the CPU time
the graph. The confidence intervals have been omitted in the used to transmit unprocessed data streams. As we can see
figures to improve readability. in Figure 9, when the join operator selectivity is less than
0.5, the prediction-base algorithm saves significant amount
Network Traffic (Prediction-Based Algorithm / Unprocess Data Stream Replication) )
of CPU time. When the selectivity is higher than 0.5, the
1 prediction-based algorithm costs more CPU time due to the
sampling overhead and rare mispredictions. When sampling
0.8
at 1%, the prediction-based algorithm works pretty well con-
Traffic Ratio

0.6 Sampling Ratio = 1%

Sampling Ratio = 5%
sidering its CPU overhead is always within 10% of the Ora-
0.4
Oracle cle algorithm.
0.2

0
0 0.2 0.4 0.6 0.8 1 6. Related Work
Selectivity

Figure 8. Network Traffic Reduction

There have been significant research efforts devoted to the
Percentage of CPU Time Used By Prediction-Based Algorithm query optimization problem in the distributed data stream
1.2
management systems [15, 5, 11, 12]. Adaptive filters [15]
1 have been proposed to be used in distributed data stream
management systems to regulate the data stream rate while
CPU Time Ratio

0.8
still guaranteeing the adequate answer precisions. The D-
0.6 Sampling Ratio = 1%
Sampling Ratio = 5% CAPE paper [11] discusses the challenges for distributed
Oracle
0.4 data stream management systems and proposes dynamic
0.2 adaptation techniques to alleviate the uneven workloads in
0
the distributed environments. In this paper, we propose the
0 0.2 0.4 0.6 0.8 1 idea of just-in-time sampling to estimate the output size of
Selectivity
query operators and use the estimation results to control the
Figure 9. CPU Overhead Reduction intermediate query result propagation strategy. Compared to
the algorithm for filter operators discussed in [15], our tech-
The network traffic results are shown in Figure 8. The nique handles the join operators, which may have bigger out-
ratio shown in the figure denotes the ratio between the net- put volume than their input volume.
work traffic volume generated by prediction-based algorithm Selectivity estimation has been one of the most stud-
to that generated by unprocessed data stream replication. We ied problem in database community as the query optimiz-
assume that the output tuples of the join operator are twice in ers use the estimation results to determine the most effi-
size as compared to that of the data stream input. As a result, cient query plans. Sampling [9], histogram [16, 17], index
the network traffic volume can only be reduced when the join trees [3, 6], and discrete wavelet transform [18] are the most
selectivity is less than 0.5. As shown in Figure 8, with sam- widely used selectivity estimation methods. Sampling has
pling rate as low as 1% and 5%, the prediction-based algo- been used extensively in tradition database systems [9, 2, 4].
rithm can save significantly when the join selectivity is less Sampling gives a more accurate estimation than paramet-
than 0.5. In fact, in both cases, the algorithm performs very ric and curve fitting methods used in traditional DBMS and
close to the ideal case shown by the Oracle algorithm. As provides a good estimation for a wide range of data types
expected, the algorithm performs better in terms of network [2]. Furthermore, since no data structure is maintained in
traffic volume when the sampling ratio is higher at 5%. As sampling-based approaches as opposed to histogram-based
the selectivity moves higher than 0.5, the algorithm generates approaches, we do not need to worry about the overhead of
slightly more network traffic. With sampling ratio at 1%, the constantly updating and maintaining the data structure. This
algorithm produces 5% more network traffic when selectiv- is a very important point in the context of data streams as the
ity is 0.6. It is caused by the occasional mispredictions as a input rate of the streams is constantly changing. Prediction-
result of low sampling ratio. The problem disappears when based QoS management [19] is the first work to consider
the sampling ratio is set at 5%. a sampling-based approach to estimate data stream query
The total CPU time used by the algorithm is shown in Fig- workload and use those results to manage the query QoS.
ure 9. The total CPU time shown in the figure contains both To the best of our knowledge, this is the first work which
the network packet sending/receiving overhead and sampler uses this approach for prediction-based QoS management in
operator execution overhead. The ratio shown is the CPU distributed environments.
7. Conclusions and Future Work stream processing. In the First Biennial Conference on Inno-
vative Database Systems (CIDR), 2003.
In this paper, we describe the prediction-based QoS man- [6] D. Comer. The ubiquitous b-tree. In Computing Surveys,
agement algorithm for distributed environments where we 1979.
apply dynamic sampling to determine the proper data stream [7] J. Demter, C. Dickmann, H. Peters, N. Steinleitner, and X. Fu.
propagation strategy. Our simulation results show that ad- Performance analysis of the tcp/ip stack of linux kernel 2.6.9.
justing data transmission strategy using prediction results Technical report, University of Goettingen, 2005.
significantly reduces the communication cost with minimal [8] L. Golab and M. Ozsu. Issues in data stream management.
amount of CPU overheads. There are a couple of ways to SIGMOD Record, 32(2), 2003.
extend this work. First, more research is needed to provide
[9] P. Haas, J. Naughton, and A. Swami. On the relative cost of
solutions for the scenario where the selectivity of join op- sampling for join selectivity estimation. In PODS ’94: Pro-
erators is very small. It is a known problem that sampling ceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART
yields high relative error rate when dealing with query opera- symposium on Principles of database systems, pages 14–24,
tors with small selectivities [4]. The high relative estimation New York, NY, USA, 1994. ACM Press.
errors cause the sampling algorithm to select less optimal [10] W. Hong and M. Stonebraker. Optimization of parallel
propagation strategies, thus compromising the performance query execution plans in xprs. In Distributed and Parallel
of the proposed algorithm. Another way to extend the current Databases, 1993.
work is to combine the operator history with the estimation.
[11] B. Liu, Y. Zhu, M. Jbantova, B. Momberger, and E. Run-
The selectivity history of the query operators provides valu- densteiner. A dynamically adaptive distributed system for
able clues about the future operator selectivity. One of the processing complex continuous queries. In Very Large Data
ideas is to utilize the volatility of operator selectivity, i.e., Bases (VLDB), 2005.
using sampling only on those operators with volatile selec-
[12] Y. Liu and B. Plale. Query optimization for distributed data
tivities. In this way, sampling overhead on those less volatile streams. In 15th International Conference on Software Engi-
operators can be saved. Lastly, evaluating this approach on a neering and Data Engineering (SEDE’06), 2006.
prototype system would be very valuable for the performance
[13] M. Mehta. Design and implementation of an interface for the
evaluation of the algorithm. As the performance study in this
integration of DynaMIT with the traffic management center.
paper is via simulations, it is desirable to develop a prototype Master’s thesis, MIT, 2001.
for a distributed DSMS and carry out detailed experiments
to study the overhead associated with the prediction-based [14] S. Milton and J. Arnold. Introduction to Probability and
Statistics: Principles and Applications for Engineering and
algorithm proposed in this paper.
the Computing Sciences. McGraw-Hill, 4th edition, 2003.
[15] C. Olston, J. Jiang, and J. Widom. Adaptive filters for con-
Acknowledgments tinuous queries over distributed data streams. In the ACM Intl
Conf. on Management of Data (SIGMOD), 2003.
The work was supported in part by NSF IIS-0208758,
[16] V. Poosala. Histogram-based estimation techniques in
CCR-0329609, and CNS-0614886. databases. PhD thesis, Univ. of Wisconsin-Madison, 1997.
[17] V. Poosala and Y. Ioannidis. Selectivity estimation without the
References attribute value independence assumption. In 23rd Int. Conf. on
Very Large Databases, Aug 1997.
[1] B. Babcock, M. Datar, and R. Motwani. Load shedding for
[18] W. Press, S. Teukolsky, W. Vetterling, and B. Flannery. Nu-
aggregation queries over data streams. In Intl. Conference on
merical Recipes in C, The Art of Scientific Computing. Cam-
Data Engineering (ICDE), 2004.
bridge University Press, 1996.
[2] D. Barbara, W. DuMouchel, C. Faloutsos, P. Hass, J. Heller-
[19] Y. Wei, V. Prasad, S. H. Son, and J. A. Stankovic. Prediction-
stein, Y. Ioannidis, H. Jagadish, T. Johnson, R. Ng, V. Poosala,
based QoS management for real-time data streams. In Proc.
K. Ross, and K. Sevcik. The new jersey data reduction report.
27th IEEE Real-Time Systems Symposium (RTSS 06), Dec.
Technical report, Bulletin of the Technical Committee on Data
2006.
Engineering, 1997.
[3] R. Bayer and E. McCreight. Organization and maintenance of [20] Y. Wei, S. H. Son, and J. A. Stankovic. RTSTREAM: Real-
large ordered indices. In Acta Informatica, 1972. time query for data streams. In 9th IEEE International Sympo-
sium on Object and component-oriented Real-time distributed
[4] S. Chaudhuri, G. Das, M. Datar, R. Motwani, and Computing (ISORC), Apr. 2006.
V. Narasayya. Overcoming limitations of sampling for ag-
gregation queries. In ICDE, pages 534–542, 2001. [21] A. Wilschut and P. Apers. Dataflow query execution in a par-
allel main-meory environment. In PDIS, 1991.
[5] M. Cherniack, H. Balakrishnan, M. Balazinska, D. Carney,
U. Cetintemel, Y. Xing, and S. Zdonik. Scalable distributed

Data Stream Management
No ratings yet
Data Stream Management
46 pages
Overview of Data Stream Processing
No ratings yet
Overview of Data Stream Processing
68 pages
Relational Stream Processing Overview
No ratings yet
Relational Stream Processing Overview
40 pages
Unit-II BDA
No ratings yet
Unit-II BDA
19 pages
Unit-II (Big Data)
No ratings yet
Unit-II (Big Data)
20 pages
Real Time Data Stream Processing Engine
No ratings yet
Real Time Data Stream Processing Engine
13 pages
Stream Processing
No ratings yet
Stream Processing
70 pages
Big Data Analytics Module 4 Mumbai University
No ratings yet
Big Data Analytics Module 4 Mumbai University
24 pages
Unit II (Big Data)
No ratings yet
Unit II (Big Data)
19 pages
Big Data Unit Ii Notes
No ratings yet
Big Data Unit Ii Notes
19 pages
Real Time Application of Database Management System Using Monitoring of Input
No ratings yet
Real Time Application of Database Management System Using Monitoring of Input
4 pages
Data Stream Management Workshop Insights
No ratings yet
Data Stream Management Workshop Insights
22 pages
Interval Query Indexing For Efficient Stream Processing
No ratings yet
Interval Query Indexing For Efficient Stream Processing
10 pages
p1 Babcock
No ratings yet
p1 Babcock
16 pages
Bda M4
No ratings yet
Bda M4
57 pages
Big Data Analytics - Unit 3
No ratings yet
Big Data Analytics - Unit 3
64 pages
Kolomvatsos K., Anagnostopoulos C., Hadjiefthymiades S., An Efficient Time Optimized Scheme For Progressive Analytics in Big Data", Big Data Research, Vol. 2, 2015, S. 155-165
No ratings yet
Kolomvatsos K., Anagnostopoulos C., Hadjiefthymiades S., An Efficient Time Optimized Scheme For Progressive Analytics in Big Data", Big Data Research, Vol. 2, 2015, S. 155-165
11 pages
Unit-2 Bda
No ratings yet
Unit-2 Bda
33 pages
4 Bda Chapter4 Answer
No ratings yet
4 Bda Chapter4 Answer
6 pages
Data Stream Mining and Management Insights
No ratings yet
Data Stream Mining and Management Insights
32 pages
An Introduction To Data Stream Query Processing: Neil Conway
No ratings yet
An Introduction To Data Stream Query Processing: Neil Conway
71 pages
Data Streams: Immanuel Trummer
No ratings yet
Data Streams: Immanuel Trummer
94 pages
Module3A MiningBigDataStreams
No ratings yet
Module3A MiningBigDataStreams
145 pages
Unit 4
No ratings yet
Unit 4
84 pages
FALLSEM2024-25 SWE2011 ETH VL2024250103282 2024-08-19 Reference-Material-I
No ratings yet
FALLSEM2024-25 SWE2011 ETH VL2024250103282 2024-08-19 Reference-Material-I
53 pages
Data Stream Processing Overview
No ratings yet
Data Stream Processing Overview
53 pages
Module 4
No ratings yet
Module 4
23 pages
Big Data Stream Mining
No ratings yet
Big Data Stream Mining
8 pages
Unit2 Bda
No ratings yet
Unit2 Bda
293 pages
Bda Unit II Lecture1
No ratings yet
Bda Unit II Lecture1
10 pages
Bda 2
No ratings yet
Bda 2
16 pages
Unit-Ii 30-1-24
No ratings yet
Unit-Ii 30-1-24
162 pages
DWDM - Unit - VII
No ratings yet
DWDM - Unit - VII
42 pages
BDA Mod 3
No ratings yet
BDA Mod 3
57 pages
Bigdata-Mining Data Streams
No ratings yet
Bigdata-Mining Data Streams
19 pages
MMD3
0% (1)
MMD3
17 pages
STREAM The Stanford Data Stream Management System
No ratings yet
STREAM The Stanford Data Stream Management System
21 pages
Data Stream MG
No ratings yet
Data Stream MG
528 pages
CSE545 Sp23 (2) Streaming Algorithms 2-4
No ratings yet
CSE545 Sp23 (2) Streaming Algorithms 2-4
60 pages
Module-2-MINING DATA STREAMS
100% (3)
Module-2-MINING DATA STREAMS
17 pages
BDA - Question Bank - 2
No ratings yet
BDA - Question Bank - 2
12 pages
SF8 - Unit 2 DDB
No ratings yet
SF8 - Unit 2 DDB
97 pages
BDA Unit-2
No ratings yet
BDA Unit-2
12 pages
Introduction To Stream Data Model
50% (2)
Introduction To Stream Data Model
15 pages
Real-Time Data Stream Applications
No ratings yet
Real-Time Data Stream Applications
18 pages
Data Analytics and Visualization Unit-III
No ratings yet
Data Analytics and Visualization Unit-III
21 pages
Traffic Analysis Using Streaming Queries: Mike Fisk Los Alamos National Laboratory
No ratings yet
Traffic Analysis Using Streaming Queries: Mike Fisk Los Alamos National Laboratory
27 pages
Modeling and Analyzing Real-Time Data Streams: Ntroduction
No ratings yet
Modeling and Analyzing Real-Time Data Streams: Ntroduction
8 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
74 pages
Understanding Data Stream Management
100% (1)
Understanding Data Stream Management
8 pages
Data-Aware Adaptive Compression For Stream Processing
No ratings yet
Data-Aware Adaptive Compression For Stream Processing
19 pages
Data Stream Unit4
No ratings yet
Data Stream Unit4
20 pages
Unit 3
No ratings yet
Unit 3
56 pages
Iucs TR04 601
No ratings yet
Iucs TR04 601
15 pages
UNIT-3 (Mining Data Streams)
No ratings yet
UNIT-3 (Mining Data Streams)
50 pages
Parallel Database: Architecture For Parallel Databases. Parallel Query Evaluation Parallelizing Individual Operations
No ratings yet
Parallel Database: Architecture For Parallel Databases. Parallel Query Evaluation Parallelizing Individual Operations
27 pages
6 Aks Stream Processing SEEP
No ratings yet
6 Aks Stream Processing SEEP
65 pages
Stream Processing for IT/CSE Students
No ratings yet
Stream Processing for IT/CSE Students
57 pages
Lec4 11 11 16
No ratings yet
Lec4 11 11 16
27 pages
Abb-Bdew-Whitepaper v4 2016 07 en
No ratings yet
Abb-Bdew-Whitepaper v4 2016 07 en
23 pages
M-Commerce PPT Isb
No ratings yet
M-Commerce PPT Isb
17 pages
Assembler Instructions
50% (4)
Assembler Instructions
101 pages
Dual Input Security Module: Siga-Sec2
No ratings yet
Dual Input Security Module: Siga-Sec2
4 pages
Doxing Guide for Hackers
0% (1)
Doxing Guide for Hackers
4 pages
Offline NT Password & Registry Editor, Bootdisk / CD
No ratings yet
Offline NT Password & Registry Editor, Bootdisk / CD
14 pages
A Project Report On Bank Management System
No ratings yet
A Project Report On Bank Management System
20 pages
2019-OUCC-Answers4Teachers
No ratings yet
2019-OUCC-Answers4Teachers
43 pages
Unit 19 Computer Systems Architecture
No ratings yet
Unit 19 Computer Systems Architecture
9 pages
Chapter - 1 and - 2
No ratings yet
Chapter - 1 and - 2
32 pages
Public Domain Book Access Guidelines
No ratings yet
Public Domain Book Access Guidelines
712 pages
Evolution of Apple macOS Architecture
No ratings yet
Evolution of Apple macOS Architecture
8 pages
9 Css Week 7 - Computer Servers and Functions - Gaspar
No ratings yet
9 Css Week 7 - Computer Servers and Functions - Gaspar
6 pages
OSPF Configuration Lab Guide
No ratings yet
OSPF Configuration Lab Guide
21 pages
Iot Car Code
No ratings yet
Iot Car Code
3 pages
Cyber Law Essay
No ratings yet
Cyber Law Essay
6 pages
SmartFill GEN 2 Telematics Overview
No ratings yet
SmartFill GEN 2 Telematics Overview
2 pages
Rsync Command Line Options Summary
No ratings yet
Rsync Command Line Options Summary
9 pages
Correct Way of Implementing A Uart Receive Buffer in A Small ARM Microcontroller
No ratings yet
Correct Way of Implementing A Uart Receive Buffer in A Small ARM Microcontroller
2 pages
Online Doctor Appointment Hub
No ratings yet
Online Doctor Appointment Hub
5 pages
Sparkylinux 5.10 x86 - 64 Minimalgui - Iso.package List
No ratings yet
Sparkylinux 5.10 x86 - 64 Minimalgui - Iso.package List
56 pages
Elemental Live Node-Locked License Install Guide R4
No ratings yet
Elemental Live Node-Locked License Install Guide R4
13 pages
400 Series Supervisor Reference
No ratings yet
400 Series Supervisor Reference
124 pages
iDirect iNFINITI Modem Installation Guide
No ratings yet
iDirect iNFINITI Modem Installation Guide
45 pages
Software Development Engineer (SDE)
No ratings yet
Software Development Engineer (SDE)
4 pages
Open Pit Design Guide
No ratings yet
Open Pit Design Guide
7 pages
NDC Chapter 1
No ratings yet
NDC Chapter 1
10 pages
200 Questions
No ratings yet
200 Questions
24 pages
Master/Slave Access Control Application: Software
No ratings yet
Master/Slave Access Control Application: Software
1 page
Understanding N-Tier Architecture
No ratings yet
Understanding N-Tier Architecture
7 pages

Qos Management of Real-Time Data Stream Queries in Distributed Environments

Uploaded by

Qos Management of Real-Time Data Stream Queries in Distributed Environments

Uploaded by

QoS Management of Real-Time Data Stream Queries in Distributed

Yuan Wei, Vibha Prasad, Sang H. Son

of data streams, it is often assumed that it is not possible to

a window of data. A window on a data stream is a segment of Join

Data Stream Speed

Figure 3. Transmitting Unprocessed Data Streams

Selection or Transmitting Join

Filtered Data Relation R

Figure 4. Transmitting Filtered Data Streams

their outputs are always smaller in size compared to their in-

Figure 5. Transmitting Intermediate Results

0.6 Sampling Ratio = 1%

Figure 8. Network Traffic Reduction

You might also like