0% found this document useful (0 votes)
6 views16 pages

p1 Babcock

This paper discusses the emerging model of data stream processing, where data arrives as continuous, transient streams rather than persistent relations. It highlights the unique challenges and requirements of query processing in this context, including issues related to memory management, sliding windows, and continuous queries. The authors also review existing projects and propose a framework for a general-purpose Data Stream Management System (DSMS) to address these challenges.

Uploaded by

suman.struc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views16 pages

p1 Babcock

This paper discusses the emerging model of data stream processing, where data arrives as continuous, transient streams rather than persistent relations. It highlights the unique challenges and requirements of query processing in this context, including issues related to memory management, sliding windows, and continuous queries. The authors also review existing projects and propose a framework for a general-purpose Data Stream Management System (DSMS) to address these challenges.

Uploaded by

suman.struc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16



Models and Issues in Data Stream Systems

Brian Babcock Shivnath Babu Mayur Datar Rajeev Motwani Jennifer Widom
Department of Computer Science
Stanford University
Stanford, CA 94305
fbabcock,shivnath,datar,rajeev,[email protected]

ABSTRACT We are developing such a system at Stanford [80], and we will touch
In this overview paper we motivate the need for and research issues on some of our own work in this paper. However, we also attempt
arising from a new model of data processing. In this model, data to provide a general overview of the area, along with its related and
does not take the form of persistent relations, but rather arrives in current work. (Any glaring omissions are, naturally, our own fault.)
multiple, continuous, rapid, time-varying data streams. In addition We begin in Section 2 by considering the data stream model and
to reviewing past work relevant to data stream systems and current queries over streams. In this section we take a simple view: streams
projects in the area, the paper explores topics in stream query lan- are append-only relations with transient tuples, and queries are SQL
guages, new requirements and challenges in query processing, and operating over these logical relations. In later sections we discuss
algorithmic issues. several issues that complicate the model and query language, such
as ordering, timestamping, and sliding windows. Section 2 also
presents some concrete examples to ground our discussion.
1. INTRODUCTION In Section 3 we review recent projects geared specifically towards
Recently a new class of data-intensive applications has become data stream processing, as well as a plethora of past research in areas
widely recognized: applications in which the data is modeled best related to data streams: active databases, continuous queries, fil-
not as persistent relations but rather as transient data streams. Ex- tering systems, view management, sequence databases, and others.
amples of such applications include financial applications, network Although much of this work clearly has applications to data stream
monitoring, security, telecommunications data management, web processing, we hope to show in this paper that there are many new
applications, manufacturing, sensor networks, and others. In the problems to address in realizing a complete DSMS.
data stream model, individual data items may be relational tuples, Section 4 delves more deeply into the area of query processing,
e.g., network measurements, call records, web page visits, sensor uncovering a number of important issues, including:
readings, and so on. However, their continuous arrival in multiple,
rapid, time-varying, possibly unpredictable and unbounded streams  Queries that require an unbounded amount of memory to eval-
appears to yield some fundamentally new research problems. uate precisely, and approximate query processing techniques
In all of the applications cited above, it is not feasible to sim- to address this problem.
ply load the arriving data into a traditional database management
system (DBMS) and operate on it there. Traditional DBMS’s are
 Sliding window query processing (i.e., considering “recent”
portions of the streams only), both as an approximation tech-
not designed for rapid and continuous loading of individual data
nique and as an option in the query language since many
items, and they do not directly support the continuous queries [82]
applications prefer sliding-window queries.
that are typical of data stream applications. Furthermore, it is rec-
ognized that both approximation [30] and adaptivity [8] are key  Batch processing, sampling, and synopsis structures to han-
ingredients in executing queries and performing other processing dle situations where the flow rate of the input streams may
(e.g., data analysis and mining) over rapid data streams, while tradi- overwhelm the query processor.

tional DBMS’s focus largely on the opposite goal of precise answers
computed by stable query plans. The meaning and implementation of blocking operators (e.g.,
In this paper we consider fundamental models and issues in devel- aggregation and sorting) in the presence of unending streams.
oping a general-purpose Data Stream ManagementSystem (DSMS).  Continuous queries that are registered when portions of the
Work supported by NSF Grant IIS-0118173. Mayur Datar was also data streams have already “passed by,” yet the queries wish
supported by a Microsoft Graduate Fellowship. Rajeev Motwani to reference stream history.
received partial support from an Okawa Foundation Research Grant.
Section 5 then outlines some details of a query language and an
Permission to make digital or hard copies of part or all of this work or architecture for a DSMS query processor designed specifically to
personal or classroom use is granted without fee provided that copies are address the issues above.
not made or distributed for profit or commercial advantage and that copies In Section 6 we review algorithmic results in data stream process-
bear this notice and the full citation on the first page. To copy otherwise, to ing. Our focus is primarily on sketching techniques and building
republish, to post on servers, or to redistribute to lists, requires prior summary structures (synopses). We also touch upon sliding win-
specific permission and/or a fee. dow computations, present some negative results, and discuss a few
ACM PODS 2002 June 3-6, Madison, Wisconsin, USA
additional algorithmic issues.
© 2002 ACM 1-58113-507-6/02/06...$5.00.
We conclude in Section 7 with some remarks on the evolution of
this new field, and a summary of directions for further work.

1
2. THE DATA STREAM MODEL 2.2 Motivating Examples
In the data stream model, some or all of the input data that are Examples motivating a data stream system can be found in many
to be operated on are not available for random access from disk or application domains including finance, web applications, security,
memory, but rather arrive as one or more continuous data streams. networking, and sensor monitoring.

Data streams differ from the conventional stored relation model in
several ways: Traderbot [83] is a web-based financial search engine that
evaluates queries over real-time streaming financial data such
 The data elements in the stream arrive online. as stock tickers and news feeds. The Traderbot web site [83]
gives some examples of one-time and continuous queries that
 The system has no control over the order in which data ele- are commonly posed by its customers.
ments arrive to be processed, either within a data stream or  Modern security applications often apply sophisticated rules
across data streams. over network packet streams. For example, iPolicy Net-

works [52] provides an integrated security platform provid-
Data streams are potentially unbounded in size. ing services such as firewall support and intrusion detection

over multi-gigabit network packet streams. Such a platform
Once an element from a data stream has been processed it is needs to perform complex stream processing including URL-
discarded or archived — it cannot be retrieved easily unless filtering based on table lookups, and correlation across mul-
it is explicitly stored in memory, which typically is small tiple network traffic flows.
relative to the size of the data streams.
 Large web sites monitor web logs (clickstreams) online to en-
Operating in the data stream model does not preclude the pres- able applications such as personalization, performance moni-
ence of some data in conventional stored relations. Often, data toring, and load-balancing. Some web sites served by widely
stream queries may perform joins between data streams and stored distributed web servers (e.g., Yahoo [93]) may need to co-
relational data. For the purposes of this paper, we will assume ordinate many distributed clickstream analyses, e.g., to track
that if stored relations are used, their contents remain static. Thus, heavily accessed web pages as part of their real-time perfor-
we preclude any potential transaction-processing issues that might mance monitoring.
arise from the presence of updates to stored relations that occur
concurrently with data stream processing.
 There are several emerging applications in the area of sensor
monitoring [15, 58] where a large number of sensors are
distributed in the physical world and generate streams of data
2.1 Queries that need to be combined, monitored, and analyzed.
Queries over continuous data streams have much in common
with queries in a traditional database management system. How- The application domain that we use for more detailed examples
ever, there are two important distinctions peculiar to the data stream is network traffic management, which involves monitoring network
model. The first distinction is between one-time queries and contin- packet header information across a set of routers to obtain informa-
uous queries [82]. One-time queries (a class that includes traditional tion on traffic flow patterns. Based on a description of Babu and
DBMS queries) are queries that are evaluated once over a point-in- Widom [10], we delve into this example in some detail to help illus-
time snapshot of the data set, with the answer returned to the user. trate that continuous queries arise naturally in real applications and
Continuous queries, on the other hand, are evaluated continuously that conventional DBMS technology does not adequately support
as data streams continue to arrive. Continuous queries are the more such queries.
interesting class of data stream queries, and it is to them that we Consider the network traffic management system of a large net-
will devote most of our attention. The answer to a continuous query work, e.g., the backbone network of an Internet Service Provider
is produced over time, always reflecting the stream data seen so far. (ISP) [29]. Such systems monitor a variety of continuous data
Continuous query answers may be stored and updated as new data streams that may be characterized as unpredictable and arriving
arrives, or they may be produced as data streams themselves. Some- at a high rate, including both packet traces and network perfor-
times one or the other mode is preferred. For example, aggregation mance measurements. Typically, current traffic-management tools
queries may involve frequent changes to answer tuples, dictating the either rely on a special-purpose system that performs online pro-
stored approach, while join queries are monotonic and may produce cessing of simple hand-coded continuous queries, or they just log
rapid, unbounded answers, dictating the stream approach. the traffic data and perform periodic offline query processing. Con-
The second distinction is between predefined queries and ad hoc ventional DBMS’s are deemed inadequate to provide the kind of
queries. A predefined query is one that is supplied to the data stream online continuous query processing that would be most beneficial
management system before any relevant data has arrived. Prede- in this domain. A data stream system that could provide effective
fined queries are generally continuous queries, although scheduled online processing of continuous queries over data streams would
one-time queries can also be predefined. Ad hoc queries, on the allow network operators to install, modify, or remove appropriate
other hand, are issued online after the data streams have already monitoring queries to support efficient management of the ISP’s
begun. Ad hoc queries can be either one-time queries or continuous network resources.
queries. Ad hoc queries complicate the design of a data stream Consider the following concrete setting. Network packet traces
management system, both because they are not known in advance are being collected from a number of links in the network. The
for the purposes of query optimization, identification of common focus is on two specific links: a customer link, C, which connects
subexpressions across queries, etc., and more importantly because the network of a customer to the ISP’s network, and a backbone link,
the correct answer to an ad hoc query may require referencing data B, which connects two routers within the backbone network of the
elements that have already arrived on the data streams (and poten- ISP. Let C and B denote two streams of packet traces corresponding
tially have already been discarded). Ad hoc queries are discussed to these two links. We assume, for simplicity, that the traces contain
in more detail in Section 4.6. just the five fields of the packet header that are listed below.

2
src: IP address of packet sender. up on the two links), the user may prefer to compute an approxi-
mate answer. One approximation technique would be to maintain
dest: IP address of packet destination. bounded-memory synopses of the two streams (see Section 6); al-
id: Identification number given by sender so that destination can ternatively, one could exploit aspects of the application semantics
uniquely identify each packet. to bound the required storage (e.g., we may know that joining tuples
are very likely to occur within a bounded time window).
len: Length of the packet. Our final example, Q4 , is a continuous query for monitoring the
source-destination pairs in the top 5 percent in terms of backbone
time: Time when packet header was recorded. traffic. For ease of exposition, we employ the WITH construct from
Consider first the continuous query Q1 , which computes load
SQL-99 [85].
on the link B averaged over one-minute intervals, notifying the
network operator when the load crosses a specified threshold t. The
Q4 :WITH Load AS
(SELECT src, dest, sum(len) AS traffic
functions getminute and notifyoperator have the natural
interpretation.
FROM B
GROUP BY src, dest)
Q1 : SELECT notifyoperator(sum(len))
SELECT src, dest, traffic
Load AS L1
FROM B FROM
WHERE (SELECT count(*)
Load AS L2
GROUP BY getminute(time)
sum(len) > t
FROM
HAVING
WHERE L2 .traffic < L1 .traffic) >
While the functionality of such a query may possibly be achieved (SELECT 
0:95 count(*) FROM Load)
in a DBMS via the use of triggers, we are likely to prefer the use of ORDER BY traffic
special techniques for performance reasons. For example, consider
the case where the link B has a very high throughput (e.g., if it 3. REVIEW OF DATA STREAM PROJECTS
were an optical link). In that case, we may choose to compute an We now provide an overview of several past and current projects
approximate answer to Q1 by employing random sampling on the related to data stream management. We will revisit some of these
stream — a task outside the reach of standard trigger mechanisms. projects in later sections when we discuss the issues that we are fac-
The second query Q2 isolates flows in the backbone link and ing in building a general-purpose data stream management system
determines the amount of traffic generated by each flow. A flow at Stanford.
is defined here as a sequence of packets grouped in time, and sent Continuous queries were used in the Tapestry system [82] for
from a specific source to a specific destination. content-based filtering over an append-only database of email and
Q2 : SELECT flowid, src, dest, sum(len) AS flowlen
bulletin board messages. A restricted subset of SQL was used as the
query language in order to provide guarantees about efficient evalua-
FROM (SELECT src, dest, len, time
FROM B tion and append-only query results. The Alert system [72] provides
a mechanism for implementing event-condition-action style trig-
ORDER BY time )
gers in a conventional SQL database, by using continuous queries
GROUP BY src, dest, getflowid(src, dest, time)
defined over special append-only active tables. The XFilter content-
AS flowid
based filtering system [6] performs efficient filtering of XML doc-
Here getflowid is a user-defined function which takes the source uments based on user profiles expressed as continuous queries in
IP address, the destination IP address,and the timestamp of a packet, the XPath language [92]. Xyleme [67] is a similar content-based
and returns the identifier of the flow to which the packet belongs. filtering system that enables very high throughput with a restricted
We assume that the data in the view (or table expression) in the query language. The Tribeca stream database manager [81] pro-
FROM clause is passed to the getflowid function in the order vides restricted querying capability over network packet streams.
defined by the ORDER BY clause. The OpenCQ [57] and NiagaraCQ [23] systems support con-
Observe that handling Q2 over stream B is particularly challeng- tinuous queries for monitoring persistent data sets spread over a
ing due to the presence of GROUP BY and ORDER BY clauses, wide-area network, e.g., web sites over the Internet. OpenCQ uses
which lead to “blocking” operators in a query execution plan. a query processing algorithm based on incremental view mainte-
Consider now the task of determining the fraction of the backbone nance, while NiagaraCQ addresses scalability in number of queries
link’s traffic that can be attributed to the customer network. This by proposing techniques for grouping continuous queries for ef-
query, Q3 , is an example of the kind of ad hoc continuous queries ficient evaluation. Within the NiagaraCQ project, Shanmugasun-
that may be registered during periods of congestion to determine daram et al. [77] discuss the problem of supporting blocking opera-
whether the customer network is the likely cause. tors in query plans over data streams, and Viglas and Naughton [87]
propose rate-based optimization for queries over data streams, a
Q3 :(SELECT count (*) new optimization methodology that is based on stream-arrival and
FROM C, B data-processing rates.
WHERE C.src = B.src and C.dest = B.dest The Chronicle data model [55] introduced append-only ordered
and C.id = B.id) = sequences of tuples (chronicles), a form of data streams. They de-
(SELECT count (*) FROM B ) fined a restricted view definition language and algebra (chronicle
algebra) that operates over chronicles together with traditional re-
Observe that Q3 joins streams C and B on their keys to obtain a lations. The focus of the work was to ensure that views defined
count of the number of common packets. Since joining two streams in chronicle algebra could be maintained incrementally without
could potentially require unbounded intermediate storage (for ex- storing any of the chronicles. An algebra and a declarative query
ample if there is no bound on the delay between a packet showing language for querying ordered relations (sequences) was proposed

3
by Seshadri, Livny, and Ramakrishnan [74, 75, 76]. In many appli- are being continually produced at a high rate over time. New data
cations, continuous queries need to refer to the sequencing aspect of is constantly arriving even as the old data is being processed; the
streams, particularly in the form of sliding windows over streams. amount of computation time per data element must be low, or else
Related work in this category also includes work on temporal [78] the latency of the computation will be too high and the algorithm
and time-series databases [31], where the ordering of tuples implied will not be able to keep pace with the data stream. For this reason,
by time can be used in querying, indexing, and query optimization. we are interested in algorithms that are able to confine themselves
The body of work on materialized views relates to continuous to main memory without accessing disk.
queries, since materialized views are effectively queries that need Arasu et al. [7] took some initial steps towards distinguishing be-
to be reevaluated or incrementally updated whenever the base data tween queries that can be answered exactly using a given bounded
changes. Of particular importance is work on self-maintenance amount of memory and queries that must be approximated un-
[14, 45, 69]—ensuring that enough data has been saved to maintain less disk accesses are allowed. They consider a limited class of
a view even when the base data is unavailable—and the related queries and, for that class, provide a complete characterization of
problem of data expiration [36]—determining when certain base the queries that require a potentially unbounded amount of mem-
data can be discarded without compromising the ability to maintain ory (proportional to the size of the input data streams) to answer.
a view. Nevertheless, several differences exist between materialized Their result shows that without knowing the size of the input data
views and continuous queries in the data stream context: continuous streams, it is impossible to place a limit on the memory require-
queries may stream rather than store their results, they may deal ments for most common queries involving joins, unless the domains
with append-only input data, they may provide approximate rather of the attributes involved in the query are restricted (either based on
than exact answers, and their processing strategy may adapt as known characteristics of the data or through the imposition of query
characteristics of the data streams change. predicates). The basic intuition is that without domain restrictions
The Telegraph project [8, 47, 58, 59] shares some target appli- an unbounded number of attribute values must be remembered, be-
cations and basic technical ideas with a DSMS. Telegraph uses an cause they might turn out to join with tuples that arrive in the future.
adaptive query engine (based on the Eddy concept [8]) to process Extending these results to full generality remains an open research
queries efficiently in volatile and unpredictable environments (e.g., problem.
autonomous data sources over the Internet, or sensor networks).
Madden and Franklin [58] focus on query execution strategies over 4.2 Approximate Query Answering
data streams generated by sensors, and Madden et al. [59] discuss As described in the previous section, when we are limited to a
adaptive processing techniques for multiple continuous queries. bounded amount of memory it is not always possible to produce
The Tukwila system [53] also supports adaptive query processing, exact answers for data stream queries; however, high-quality ap-
in order to perform dynamic data integration over autonomous data proximate answers are often acceptable in lieu of exact answers.
sources. Approximation algorithms for problems defined over data streams
The Aurora project [15] is building a new data processing system has been a fruitful research area in the algorithms community in
targeted exclusively towards stream monitoring applications. The recent years, as discussed in detail in Section 6. This work has led
core of the Aurora system consists of a large network of triggers. to some general techniques for data reduction and synopsis con-
Each trigger is a data-flow graph with each node being one among struction, including: sketches [5, 35], random sampling [1, 2, 21],
seven built-in operators (or boxes in Aurora’s terminology). For histograms [51, 68], and wavelets [16, 90]. Based on these summa-
each stream monitoring application using the Aurora system, an rization techniques, we have seen some work on approximate query
application administrator creates and adds one or more triggers answering. For example, recent work [26, 37] develops histogram-
into Aurora’s trigger network. Aurora performs both compile-time based techniques to provide approximate answers for correlated
optimization (e.g., reordering operators, shared state for common aggregate queries over data streams, and Gilbert et al. [40] present
subexpressions) and run-time optimization of the trigger network. a general approach for building small-space summaries over data
As part of run-time optimization, Aurora detects resource overload streams to provide approximate answers for many classes of aggre-
and performs load shedding based on application-specific measures gate queries. However, research problems abound in the area of
of quality of service. approximate query answering, with or without streams. Even the
basic notion of approximations remains to be investigated in detail
4. QUERIES OVER DATA STREAMS for queries involving more than simple aggregation. In the next
Query processing in the data stream model of computation comes two subsections, we will touch upon several approaches to approx-
with its own unique challenges. In this section, we will outline imation, some of which are peculiar to the data stream model of
what we consider to be the most interesting of these challenges, and computation.
describe several alternative approaches for resolving them. The
issues raised in this section will frame the discussion in the rest of 4.3 Sliding Windows
the paper. One technique for producing an approximate answer to a data
stream query is to evaluate the query not over the entire past history
4.1 Unbounded Memory Requirements of the data streams, but rather only over sliding windows of recent
Since data streams are potentially unbounded in size, the amount data from the streams. For example, only data from the last week
of storage required to compute an exact answer to a data stream could be considered in producing query answers, with data older
query may also grow without bound. While external memory al- than one week being discarded.
gorithms [89] for handling data sets larger than main memory have Imposing sliding windows on data streams is a natural method
been studied, such algorithms are not well suited to data stream for approximation that has several attractive properties. It is well-
applications since they do not support continuous queries and are defined and easily understood: the semantics of the approximation
typically too slow for real-time response. The continuous data are clear, so that users of the system can be confident that they
stream model is most applicable to problems where timely query understand what is given up in producing the approximate answer.
responses are important and there are large volumes of data that It is deterministic, so there is no danger that unfortunate random

4
choices will produce a bad approximation. Most importantly, it date is not possible. We consider the two possible bottlenecks and
emphasizes recent data, which in the majority of real-world appli- approaches for dealing with them.
cations is more important and relevant than old data: if one is trying
in real-time to make sense of network traffic patterns, or phone call Batch Processing
or transaction records, or scientific sensor data, then in general in- The first scenario is that the update operation is fast but the
sights based on the recent past will be more informative and useful computeAnswer operation is slow. In this case, the natural
than insights based on stale data. In fact, for many such applica- solution is to process the data in batches. Rather than producing
tions, sliding windows can be thought of not as an approximation a continually up-to-date answer, the data elements are buffered as
technique reluctantly imposed due to the infeasibility of computing they arrive, and the answer to the query is computed periodically
over all historical data, but rather as part of the desired query seman- as time permits. The query answer may be considered approximate
tics explicitly expressed as part of the user’s query. For example, in the sense that it is not timely, i.e., it represents the exact answer
queries Q3 and Q4 from Section 2.2, which tracked traffic on the at a point in the recent past rather than the exact answer at the
network backbone, would likely be applied not to all traffic over all present moment. This approach of approximation through batch
time, but rather to traffic in the recent past. processing is attractive because it does not cause any uncertainty
There are a variety of research issues in the use of sliding windows about the accuracy of the answer, sacrificing timeliness instead. It
over data streams. To begin with, as we will discuss in Section 5.1, is also a good approach when data streams are bursty. An algorithm
there is the fundamental issue of how we define timestamps over that cannot keep up with the peak data stream rate may be able to
the streams to facilitate the use of windows. Extending SQL or handle the average stream rate quite comfortably by buffering the
relational algebra to incorporate explicit window specifications is streams when their rate is high and catching up during the slow
nontrivial and we also touch upon this topic in Section 5.1. The im- periods. This is the approach used in the XJoin algorithm [86].
plementation of sliding window queries and their impact on query
optimization is a largely untouched area. In the case where the Sampling
sliding window is large enough so that the entire contents of the In the second scenario, computeAnswer may be fast, but the
window cannot be buffered in memory, there are also theoretical update operation is slow — it takes longer than the average inter-
challenges in designing algorithms that can give approximate an- arrival time of the data elements. It is futile to attempt to make use of
swers using only the available memory. Some recent results in this all the data when computing an answer, because data arrives faster
vein can be found in [9, 25]. than it can be processed. Instead, some tuples must be skipped
While existing work on sequence and temporal databases has altogether, so that the query is evaluated over a sample of the data
addressed many of the issues involved in time-sensitive queries (a stream rather than over the entire data stream. We obtain an approx-
class that includes sliding window queries) in a relational database imate answer, but in some cases one can give confidence bounds
context [74, 75, 76, 78], differences in the data stream computation on the degree of error introduced by the sampling process [48].
model pose new challenges. Research in temporal databases [78] Unfortunately, for many situations (including most queries involv-
is concerned primarily with maintaining a full history of each data ing joins [19, 21]), sampling-based approaches cannot give reliable
value over time, while in a data stream system we are concerned approximation guarantees. Designing sampling-based algorithms
primarily with processing new data elements on-the-fly. Sequence that can produce approximate answers that are provably close to the
databases [74, 75, 76] attempt to produce query plans that allow exact answer is an important and active area of research.
for stream access, meaning that a single scan of the input data is
sufficient to evaluate the plan and the amount of memory required Synopsis Data Structures
for plan evaluation is a constant, independent of the data. This
Quite obiously, data structures where both the update and the
model assumes that the database system has control over which
computeAnswer operations are fast are most desirable. For
sequence to process tuples from next, e.g., when merging multiple
classes of data stream queries where no exact data structure with
sequences, which we cannot assume in a data stream system.
the desired properties exists, one can often design an approximate
data structure that maintains a small synopsis or sketch of the data
4.4 Batch Processing, Sampling, and Synopses rather than an exact representation, and therefore is able to keep
Another class of techniques for producing approximate answers computation per data element to a minimum. Performing data
is to give up on processing every data element as it arrives, resorting reduction through synopsis data structures as an alternative to batch
to some sort of sampling or batch processing technique to speed up processing or sampling is a fruitful research area with particular
query execution. We describe a general framework for these tech- relevance to the data stream computation model. Synopsis data
niques. Suppose that a data stream query is answered using a data structures are discussed in more detail in Section 6.
structure that can be maintained incrementally. The most general
description of such a data structure is that it supports two operations, 4.5 Blocking Operators
update(tuple) and computeAnswer(). The update op- A blocking query operator is a query operator that is unable to
eration is invoked to update the data structure as each new data ele- produce the first tuple of its output until it has seen its entire input.
ment arrives, and the computeAnswer method produces new or Sorting is an example of a blocking operator, as are aggregation
updated results to the query. When processing continuous queries, operators such as SUM, COUNT, MIN, MAX, and AVG. If one
the best scenario is that both operations are fast relative to the arrival thinks about evaluating continuous stream queries using a traditional
rate of elements in the data streams. In this case, no special tech- tree of query operators, where data streams enter at the leaves and
niques are necessary to keep up with the data stream and produce final query answers are produced at the root, then the incorporation
timely answers: as each data element arrives, it is used to update of blocking operators into the query tree poses problems. Since
the data structure, and then new results are computed from the data continuous data streams may be infinite, a blocking operator that
structure, all in less than the average inter-arrival time of the data has a data stream as one of its inputs will never see its entire
elements. If one or both of the data structure operations are slow, input, and therefore it will never be able to produce any output.
however, then producing an exact answer that is continually up to Clearly, blocking operators are not very suitable to the data stream

5
computation model, but aggregate queries are extremely common, has been streamed by, it cannot be revisited. This limitation means
and sorted data is easier to work with and can often be processed that ad hoc queries that are issued after some data has already been
more efficiently than unsorted data. Doing away with blocking discarded may be impossible to answer accurately. One simple
operators altogether would be problematic, but dealing with them solution to this problem is to stipulate that ad hoc queries are only
effectively is one of the more challenging aspects of data stream allowed to reference future data: they are evaluated as though the
computation. data streams began at the point when the query was issued, and any
Blocking operators that are the root of a tree of query operators past stream elements are ignored (for the purposes of that query).
are more tractable than blocking operators that are interior nodes While this solution may not appear very satisfying, it may turn out
in the tree, producing intermediate results that are fed to other to be perfectly acceptable for many applications.
operators for further processing (for example, the “sort” phase of A more ambitious approach to handling ad hoc queries that refer-
a sort-merge join, or an aggregate used in a subquery). When ence past data is to maintain summaries of data streams (in the form
we have a blocking aggregation operator at the root of a query of general-purpose synopses or aggregates) that can be used to give
tree, if the operator produces a single value or a small number of approximate answers to future ad hoc queries. Taking this approach
values, then updates to the answer can be streamed out as they are requires making a decision in advance about the best way to use
produced. When the answer is larger, however, such as when the memory resources to give good approximate answers to a broad
query answer is a relation that is to be produced in sorted order, it range of possible future queries. The problem is similar in some
is more practical to maintain a data structure with the up-to-date ways to problems in physical database design such as selection of
answer, since continually retransmitting the entire answer would indexes and materialized views [22]. However, there is an impor-
be cumbersome. Neither of these two approaches works well for tant difference: in a traditional database system, when an index or
blocking operators that produce intermediate results, however. The view is lacking, it is possible to go to the underlying relation, albeit
central problem is that the results produced by blocking operators at an increased cost. In the data stream model of computation, if
may continue to change over time until all the data has been seen, the appropriate summary structure is not present, then no further
so operators that are consuming those results cannot make reliable recourse is available.
decisions based on the results at an intermediate stage of query Even if ad hoc queries are declared only to pertain to future
execution. data, there are still research issues involved in how best to process
One approach to handling blocking operators as interior nodes them. In data stream applications, where most queries are long-
in a query tree is to replace them with non-blocking analogs that lived continuous queries rather than ephemeral one-time queries,
perform approximately the same task. An example of this approach the gains that can be achieved by multi-query optimization can be
is the juggle operator [70], which is a non-blocking version of sort: significantly greater than what is possible in traditional database
it aims to locally reorder a data stream so that tuples that come systems. The presence of ad hoc queries transforms the problem
earlier in the desired sort order are produced before tuples that of finding the best query plan for a set of queries from an offline
come later in the sort order, although some tuples may be delivered problem to an online problem. Ad hoc queries also raise the issue of
out of order. An interesting open problem is how to extend this adaptivity in query plans. The Eddy query execution framework [8]
work to other types of blocking operators, as well as to quantify the introduces the notion of flexible query plans that adapt to changes in
error that is introduced by approximating blocking operators with data arrival rates or other data characteristics over time. Extending
non-blocking ones. this idea to adapt the joint plan for a set of continuous queries as
Tucker et al. [84] have proposed a different approach to blocking new queries are added and old ones are removed remains an open
operators. They suggest augmenting data streams with assertions research area.
about what can and cannot appear in the remainder of the data
stream. These assertions, which are called punctuations, are inter- 5. PROPOSAL FOR A DSMS
leaved with the data elements in the streams. An example of the
At Stanford we have begun the design and prototype implemen-
type of punctuation one might see in a stream with an attribute called
daynumber is “for all future tuples, daynumber  10.” Upon tation of a comprehensive DSMS called STREAM (for STanford
StREam DatA Manager) [80]. As discussed in earlier sections, in
seeing this punctuation, an aggregation operator that was grouping
a DSMS traditional one-time queries are replaced or augmented
by daynumber could stream out its answers for all daynumbers
with continuous queries, and techniques such as sliding windows,
less than 10. Similarly, a join operator could discard all its saved
synopsis structures, approximate answers, and adaptive query pro-
state relating to previously-seen tuples in the joining stream with
daynumber < 10, reducing its memory consumption.
cessing become fundamental features of the system. Other aspects
of a complete DBMS also need to be reconsidered, including query
An interesting open problem is to formalize the relationship be-
languages, storage and buffer management, user and application
tween punctuation and the memory requirements of a query — e.g.,
interfaces, and transaction support. In this section we will focus
a query that might otherwise require unbounded memory could be
primarily on the query language and query processing components
proved to be answerable in bounded memory if guarantees about
of a DSMS and only touch upon other issues based on our initial
the presence of appropriate punctuation are provided. Closely re-
experiences.
lated is the idea of schema-level assertions (constraints) on data
streams, which also may help with blocking operators and other 5.1 Query Language for a DSMS
aspects of data stream processing. For example, we may know that
Any general-purpose data management system must have a flex-
daynumbers are clustered or strictly increasing, or when joining
ible and intuitive method by which the users of the system can
two stream we may know that a kind of “referential integrity” exists
express their queries. In the STREAM project, we have chosen
in the arrival of join attribute values. In both cases we may use these
to use a modified version of SQL as the query interface to the
constraints to “unblock” operators or reduce memory requirements.
system (although we are also providing a means to submit query
plans directly). SQL is a well-known language with a large user
4.6 Queries Referencing Past Data population. It is also a declarative language that gives the system
In the data stream model of computation, once a data element flexibility in selecting the optimal evaluation procedure to produce

6
the desired answer. Other methods for receiving queries from users Contrast the previous query to a similar one that computes the
are possible; for example, the Aurora system described in [15] uses average call length considering only long-distance calls that are
a graphical “boxes and arrows” interface for specifying data flow among the last 10 calls of all types placed by each customer:
through the system. This interface is intuitive and gives the user
more control over the exact series of steps by which the query an- SELECT AVG(S.minutes)
swer is obtained than is provided by a declarative query language. FROM Calls S [PARTITION BY S.customer id
The main modification that we have made to standard SQL, in ROWS 10 PRECEDING]
addition to allowing the FROM clause to refer to streams as well as WHERE S.type = ’Long Distance’
relations, is to extend the expressiveness of the query language for
sliding windows. It is possible to formulate sliding window queries The distinction between filtering predicates applied before calculat-
in SQL by referring to timestamps explicitly, but it is often quite ing the sliding window cutoffs and predicates applied after window-
awkward. SQL-99 [13, 79] introduces analytical functions that ing motivates our inclusion of an optional WHERE clause within
partially address the shortcomings of SQL for expressing sliding the window specification.
window queries by allowing the specification of moving averages Here is a slightly more complicated example returning the av-
and other aggregation operations over sliding windows. However, erage length of the last 1000 telephone calls placed by “Gold”
the SQL-99 syntax is not sufficiently expressive for data stream customers:
queries since it cannot be applied to non-aggregation operations
SELECT AVG(V.minutes)
such as joins.
FROM (SELECT S.minutes
The notion of sliding windows requires at least an ordering on
FROM Calls S, Customers T
data stream elements. In many cases, the arrival order of the el-
WHERE S.customer id = T.customer id
ements suffices as an “implicit timestamp” attached to each data
AND T.tier = ’Gold’)
element; however, sometimes it is preferable to use “explicit times-
V [ROWS 1000 PRECEDING]
tamps” provided as part of the data stream. Formally we say (follow-
ing [15]) that a data stream S consists of a set of (tuple, timestamp)
f g
pairs: S = (s1 ; i1 ); (s2 ; i2 ); : : : ; (sn ; in ) . The timestamp at-
Notice that in this example, the stream of calls must be joined to
the Customers relation before applying the sliding window.
tribute could be a traditional timestamp or it could be a sequence
number — all that is required is that it comes from a totally ordered 5.2 Timestamps in Streams
domain with a distance metric. The ordering induced by the times-
In the previous section, sliding windows are defined with respect
tamps is used when selecting the data elements making up a sliding
to a timestamp or sequence number attribute representing a tuple’s
window.
arrival time. This approach is unambiguous for tuples that come
We extend SQL by allowing an optional window specification
from a single stream, but it is less clear what is meant when attempt-
to be provided, enclosed in brackets, after a stream (or subquery
ing to apply sliding windows to composite tuples that are derived
producing a stream) that is supplied in a query’s FROM clause. A
from tuples from multiple underlying streams (e.g., windows on the
window specification consists of:
output of a join operator). What should the timestamp of a tuple in
1. an optional partitioning clause, which partitions the data into the join result be when the timestamps of the tuples that were joined
several groups and maintains a separate window for each to form the result tuple are different? Timestamp issues also arise
group, when a set of distributed streams make up a single logical stream,
as in the web monitoring application described in Section 2.2, or in
2. a window size, either in “physical” units (i.e., the number of truly distributed streams such as sensor networks when comparing
data elements in the window) or in “logical” units (i.e., the timestamps across stream elements may be relevant.
range of time covered by a window, such as 30 days), and In the previous section we introduced implicit timestamps, in
which the system adds a special field to each incoming tuple, and
3. an optional filtering predicate. explicit timestamps, in which a data attribute is designated as the
timestamp. Explicit timestamps are used when each tuple corre-
As in SQL-99, physical windows are specified using the ROWS
sponds to a real-world event at a particular time that is of importance
keyword (e.g., ROWS 50 PRECEDING), while logical windows
to the meaning of the tuple. Implicit timestamps are used when the
are specified via the RANGE keyword (e.g., RANGE 15 MINUTES
data source does not already include timestamp information, or
PRECEDING). In lieu of a formal grammar, we present several
when the exact moment in time associated with a tuple is not im-
examples to illustrate our language extension.
portant, but general considerations such as “recent” or “old” may
The underlying source of data for our examples will be a stream of
be important. The distinction between implicit and explicit times-
telephone call records, each with four attributes: customer id,
tamps is similar to that between transaction and valid time in the
type, minutes, and timestamp. The timestamp attribute
temporal database literature [78].
is the ordering attribute for the records. Suppose a user wanted to
Explicit timestamps have the drawback that tuples may not arrive
compute the average call length, but considering only the ten most
in the same order as their timestamps — tuples with later times-
recent long-distance calls placed by each customer. The query can
tamps may come before tuples with earlier timestamps. This lack
be formulated as follows:
of guaranteed ordering makes it difficult to perform sliding window
SELECT AVG(S.minutes) computations that are defined in relation to explicit timestamps, or
FROM Calls S [PARTITION BY S.customer id any other processing based on order. However, as long as an input
ROWS 10 PRECEDING stream is “almost-sorted” by timestamp, except for local perturba-
WHERE S.type = ’Long Distance’] tions, then out-of-order tuples can easily be corrected with little
buffering. It seems reasonable to assume that even when explicit
where the expression in braces defines a sliding window on the timestamps are used, tuples will be delivered in roughly increasing
stream of calls. timestamp order.

7
Let us now look at how to assign appropriate timestamps to tu- sisting of the previous 10 tuples, strictly sorted by timestamp order.
ples output by binary operators, using join as an example. There By comparison, ROWS 10 RECENT also specifies a sliding win-
are several possible approaches that could be taken; we discuss two. dow consisting of 10 records, but the DSMS is allowed to use
The first approach, which fits better with implicit timestamps, is to its own ordering to produce the sliding window, rather than be-
provide no guarantees about the output order of tuples from a join ing constrained to follow the timestamp ordering. The RECENT
operator, but to simply assume that tuples that arrive earlier are likely keyword is only used with “physical” window sizes specified as a
to pass through the join earlier; exact ordering may depend on im- number of records; “logical” windows such as RANGE 3 DAYS
plementation and scheduling vagaries. Each tuple that is produced PRECEDING must use the PRECEDING keyword.
by a join operator is assigned an implicit timestamp that is set to the
time that it was produced by the join operator. This “best-effort” 5.3 Query Processing Architecture of a DSMS
approach has the advantage that it maximizes implementation flex- In this section, we describe the query processing architecture of
ibility; it has the disadvantage that it makes it impossible to impose our DSMS. So far we have focused on continuous queries only.
precisely-defined, deterministic sliding-window semantics on the When a query is registered, a query execution plan is produced that
results of subqueries. begins executing and continues indefinitely. We have not yet ad-
The second approach, which fits with either explicit or implicit dressed ad hoc queries registered after relevant streams have begun
timestamps, is to have the user specify as part of the query what (Section 4.6).
timestamp is to be assigned to tuples resulting from the join of Query execution plans in our system consist of operators con-
multiple streams. One simple policy is that the order in which nected by queues. Operators can maintain intermediate state in
the streams are listed in the FROM clause of the query represents synopsis data structures. A portion of an example query plan is
a prioritization of the streams. The timestamp for a tuple output shown in Figure 1, with one binary operator (Op1 ) and one unary
by a join should be the timestamp of the joining tuple from the operator (Op2 ). The two operators are connected by a queue Q3 ,
input stream listed first in the FROM clause. This approach can and operator Op1 has two input queues, Q1 and Q2 . Also shown in
result in multiple tuples with the same timestamp; for the purpose Figure 1 are two synopsis structures used by operator Op1 , Syn1
of ordering the results, ties can be broken using the timestamp of and Syn2 , one per input. For example, Op1 could be a sliding
the other input stream. For example, if the query is window join operator, which maintains a sliding window synopsis
for each join input (Section 4.3). The system memory is partitioned
SELECT * dynamically among the synopses and queues in query plans, along
FROM S1 [ROWS 1000 PRECEDING], with the buffers used for handling streams coming over the network
S2 [ROWS 100 PRECEDING] and a cache for disk-resident data. Note that both Aurora [15] and
WHERE S1.A = S2.B Eddies [8] use a single globally-shared queue for inter-operator data
then the output tuples would first be sorted by the timestamp of S1, flow instead of separate queues between operators as in Figure 1.
and then ties would be broken according to the timestamp of S2.
The second, stricter approach to assigning timestamps to the
results of binary operators can have a drawback from an implemen- Op 2
tation point of view. If it is desirable for the output from a join to Syn 3
be sorted by timestamp, the join operator will need to buffer output
tuples until it can be certain that future input tuples will not disrupt
the ordering of output tuples. For example, if S1’s timestamp has Q3
priority over S2’s and a recent S1 tuple joins with an S2 tuple, it
is possible that a future S2 tuple will join with an older S1 tuple
that still falls within the current window. In that case, the join tuple Op 1
that was produced second belongs before the join tuple that was Syn 1 Syn 2
produced first. In a query tree consisting of multiple joins, the extra
latency introduced for this reason could propagate up the tree in an
additive fashion. If the inputs to the join operator did not have slid-
Q1 Q2
ing windows at all, then the join operator could never confidently
produce outputs in sorted order.
As mentioned earlier, sliding windows have two distinct pur-
poses: sometimes they are an important part of the query semantics,
and other times they are an approximation scheme to improve query Figure 1: A portion of a query plan in our DSMS.
efficiency and reduce data volumes to a manageable size. When
the sliding window serves mostly to increase query processing ef-
ficiency, then the best-effort approach, which allows wide latitude Operators in our system are scheduled for execution by a cen-
over the ordering of tuples, is usually acceptable. On the other tral scheduler. During execution, an operator reads data from its
hand, when the ordering of tuples plays a significant role in the input queues, updates the synopsis structures that it maintains, and
meaning of the query, such as for query-defined sliding windows, writes results to its output queues. (Our operators thus adhere to the
then the stricter approach may be preferred, even at the cost of less update and computeAnswer model discussed in Section 4.4.)
efficient implementation. A general-purpose data stream process- The period of execution of an operator is determined dynamically
ing system should support both types of sliding windows, and the by the scheduler and the operator returns control back to the sched-
query language should allow users to specify one or the other. uler once its period expires. We are experimenting with different
In our system, we add an extra keyword, RECENT, that replaces policies for scheduling operators and for determining the period
PRECEDING when a “best-effort” ordering may be used. For ex- of execution. The period of execution may be based on time, or
ample, the clause ROWS 10 PRECEDING specifiesa window con- it may be based on other quantities, such as the number of tuples

8
consumed or produced. Both Aurora and Eddies have chosen to dynamic environment in which it operates. Systems such as Au-
perform fine-grained scheduling where, in each step, the scheduler rora [15] and Hancock [24] completely eliminate declarative query-
chooses a tuple from the global queue and passes it to an opera- ing and provide only procedural mechanisms for querying. In con-
tor for processing, an approach that our scheduler could choose if trast, we will provide a declarative language for continuous queries,
appropriate. similar to SQL but extended with operators such as those discussed
We expect continuous queries and the data streams on which in Section 5.1, as well as a mechanism for directly submitting plans
they operate to be long-running. During the lifetime of a contin- in the query algebra that underlies our language.
uous query parameters such as stream data characteristics, stream We are developing a comprehensive DSMS interface that allows
flow rates, and the number of concurrently running queries may users and administrators to visually monitor the execution of contin-
vary considerably. To handle these fluctuations, all of our opera- uous queries, including memory usage and approximation behavior.
tors are adaptive. So far we have focused primarily on adaptivity We will also provide a way for administrators to adjust system pa-
to available memory, although other factors could be considered, rameters as queries are running, including memory allocation and
including using disk to increase temporary storage at the expense scheduling policies.
of latency.
Our approach to memory adaptivity is basically one of trading
accuracy for memory. Specifically, each operator maximizes the
6. ALGORITHMIC ISSUES
accuracy of its output based on the size of its available memory, The algorithms community has been fairly active of late in the
and handles dynamic changes in the size of its available memory area of data streams, typically motivated by problems in databases
gracefully, since at run-time memory may be taken away from one and networking. The model of computation underlying the algorith-
operator and given to another. As a simple example, a sliding mic work is similar to that in Section 2 and can be formally stated as
window join operator as discussed above may be used as an ap- follows: A data stream algorithm takes as input a sequence of data
proximation to a join over the entire history of input streams. If so, items x1 ; : : : ; xN ; : : : called the data stream, where the sequence
the larger the windows (stored in available memory), the better the is scanned only once in the increasing order of the indexes. The
approximation. Other examples include duplicate elimination using algorithm is required to maintain the value of a function f on the
limited-size hash tables, and sampling using reservoirs [88]. The prefix of the stream seen so far.
Aurora system [15] also proposes adaptivity and approximations, The main complexity measure is the space used by the algorithm,
and uses load-shedding techniques based on application-specified although the time required to process each stream element is also
measures of quality of service for graceful degradation in the face relevant. In some cases, the algorithm maintains a data structure
of system overload. which can be used to compute the value of the function f on demand,
Our fundamental approach of trading accuracy for memory brings and then the time required to process each such query also becomes
up some interesting problems: of interest. Henzinger et al. [49] defined a similar model but also
allowed the algorithm to make multiple passes over the stream data,
 We first need to understand how different query operators can making the number of passes itself a complexity measure. We will
produce approximate answers under limited memory, and restrict our attention to algorithms which are allowed only one pass.
how approximate results behave when operators are com- We will measure space and time in terms of the parameter N
posed in query plans. which denotes the number of stream elements seen so far. The
 Given a query plan as a tree of operators and a certain amount
primary performance goal is to ensure that the space required by a
stream algorithm is “small.” Ideally, one would want the memory
bound to be independent of N (which is unbounded). However,
of memory, how can the DSMS allocate memory to the op-
erators to maximize the accuracy of the answer to the query
(i.e., minimize approximation)? for most interesting problems it is easy to prove a space lower
bound that precludes this possibility, thereby forcing us to settle for
 Under changing conditions, how can the DSMS reallocate bounds that are merely sublinear in N . A problem is considered to
memory among operators? be “well-solved” if one can devise an algorithm which requires at
most O(poly(log N )) space and O(poly(log N )) processing time
 Suppose we are given a query rather than a query plan. How per data element or query1. We will see that in some cases it is
does the query optimizer efficiently find the plan that, with the impossible to achieve such an algorithm, even if one is willing to
best memory allocation, minimizes approximation? Should settle for approximations.
plans be modified when conditions change? The rest of this section summarizes the state of the art for data
 Even further, since synopses could be shared among query stream algorithms, at least as relevant to databases. We will focus
plans [73], how do we optimally consider a set of queries, primarily on the problems of creating summary structures (syn-
which may be weighted by importance? opses) for a data stream, such as histograms, wavelet representa-
tion, clustering, and decision trees; in addition, we will also touch
In addition to memory management, we are faced the problem of upon known lower bounds for space and time requirements of data
scheduling multiple query plans in a DSMS. The scheduler needs stream algorithms. Most of these summary structures have been
to provide rate synchronization within operators (such as stream considered for traditional databases [30]. The challenge is to adapt
joins) and also across pipelined operators in query plans [8, 87]. some of these techniques to the data stream model.
Time-varying arrival rates of data streams and time-varying output
rates of operators add to the complexity of scheduling. Scheduling 6.1 Random Samples
decisions also need to take into account memory allocation across Random samples can be used as a summary structure in many
operators, including management of buffers for incoming streams, scenarios where a small sample is expected to capture the essential
availability of synopses on disk as opposed to in memory, and the characteristics of the data set [65]. It is perhaps the easiest form
performance requirements of individual queries. of summarization in a DSMS and other synopses can be built from
Aside from the query processing architecture, user and appli-
cation interfaces need to be reinvestigated in a DSMS given the 1
We use poly to denote a polynomial function.

9
a sample itself. In fact, the join synopsis in the AQUA system [2] tricks, we can combine several independent estimators to accurately
is nothing but a uniform sample of the base relation. Recently and with high probability obtain an estimate of F2 . At an intuitive
stratified sampling has been proposed as an alternative to uniform level, we can view this technique as a tug-of-war where elements
sampling to reduce error due to the variance in data and also to are randomly assigned to one of the two sides of the rope based on
reduce error for group-by queries [1, 18]. To actually compute a the value i; the square of the displacement of the rope captures the
random sample over a data stream is relatively easy. The reservoir skew F2 in the data.
sampling algorithm of Vitter [88] makes one pass over the data Observe that computing the self-join size of a relation is exactly
set and is well suited for the data stream model. There is also an the same as computing F2 for the values of the join attribute in the
extension by Chaudhuri, Motwani and Narasayya [21] to the case relation. Alon et al. [4] extended this technique to estimating the
of weighted sampling. join size between two distinct relations A and B , as follows. Let Y
and Z be random variables corresponding to A and B , respectively,
6.2 Sketching Techniques similar to the random variable X above; the mapping from domain
In their seminal paper, Alon, Matias and Szegedy [5] introduced values i to zi is the same for both relations. Then, it can be
the notion of randomized sketching which has been widely used ever j
proved that the estimator Y Z has expected value equal to A 1 B j
since. Sketching involves building a summary of a data stream using j jj j
and variance less than 2 A 1 A B 1 B . In order to get small
a small amount of memory, using which it is possible to estimate relative error, we can use O( jA1jAA1jjB
B1Bj ) independent estimators.
j2
the answer to certain queries (typically, “distance” queries) over the Observe that for estimating joins between two relations, the number
data set. of estimators depends on the data distribution. In a recent paper,
Let S = (x1 ; : : : ; xN ) be a sequence of elements where each Dobra et al. [26] extended this technique to estimate the size of
f g
xi belongs to the domain D = 1; : : : ; d . Let the multiplicity multi-way joins and for answering complex aggregates queries over
jf j gj
mi = j xj = i denote the number of occurrences of value i them. They also provide techniques to optimally partition the data
P 
in the sequence S . For k 0, the kth frequency moment Fk of S
is defined as Fk = di=1 mki ; further, we define F1  = maxi mi .
domain and use estimators on each partition independently, so as to
minimize the total memory requirement.
The frequency moments capture the statistics of the distribution of The frequency moment F2 can also be viewed as the L2 norm
values in S — for instance, F0 is the number of distinct values in of a vector whose value along the ith dimension is the multiplicity
the sequence, F1 is the length of the sequence, F2 is the self-join mi . Thus, the above technique can be used to compute the L2 norm
size (also called Gini’s index of homogeneity), and F1 is the most under a update model for vectors, where each data element (v; i)
frequent item’s multiplicity. It is not very difficult to see that an increments (or decrements) some mi by a quantity v. On seeing
exact computation of these moments requires linear space and so such an update, we update the corresponding sketch by adding vzi
we focus our attention on approximations. to the sum. The sketching idea can also be extended to compute
The problem of efficiently estimating the number of distinct val- the L1 norm of a vector, as follows. Let us assume that each
ues (F0 ) has received particular attention in the database literature, dimension of the underlying vector is an integer, bounded by M .
particularly in the context of using single pass or random sampling Consider the unary representation of the vector. It has Md bit
algorithms [17, 46]. A sketching technique to compute F0 was positions (elements), where d is the dimension of the underlying
presented earlier by Flajolet and Martin [35]; however, this had the vector. A 1 in the unary representation denotes that the element
drawback of requiring explicit families of hash functions with very corresponding to the bit position is present once; otherwise, it is not
strong independence properties. This requirement was relaxed by present. Then F2 captures the L1 norm of the vector. The catch
Alon, Matias, and Szegedy [5] who presented a sketching technique
to estimate F0 within a constant factor2 . Their technique uses linear we can efficiently compute the range sum rji=0
P
is that, given an element ri along dimension i, it is required that
,1 zi;j of the hash
hash functions and requires only O(log d) memory. The key con- values corresponding to the pertinent bit positions that are set to
tribution of Alon et al. [5] was a sketching technique to estimate F2 1. Feigenbaum et al. [33] showed how to construct such a family
that uses only O(log d +log N ) space and provides arbitrarily small 
of range-summable 1-valued hash functions with limited (four-
approximation factors. This technique has found many applications way) independence. Indyk [50] provided a uniform framework to
in the database literature, including join size estimation [4], esti- 2
compute the Lp norm (for p (0; 2]) using the so-called p-stable
mating L1 norm of vectors [33], and processing complex aggregate distributions, improving upon the previous paper [33] for estimating
queries over multiple streams [26, 37]. It remains an open problem the L1 norm, in that it allowed for arbitrary addition and deletion
to come up with techniques to maintain correlated aggregates [37] updates in every dimension. The ability to efficiently compute L1
that have provable guarantees. and L2 norm of the difference of two vectors is central to some
The key idea behind the F2-sketching technique can be described synopsis structures designed for data streams.
as follows: Every element i in the domain D is hashed uniformly at
P
random onto a value zi 2 f, g
1; +1 . Define the random variable
X = i mi zi and return X 2 as the estimator of F2 . Observe that
6.3 Histograms
Histograms are commonly-used summary structures to succinctly
the estimator can be computed in a single pass over the data provided capture the distribution of values in a data set (i.e., a column, or
we can efficiently compute the zi values. If the hash functions have possibly even a collection of columns, in a table). They have been
four-way independence3, it is easy to prove that the quantity X 2 has employed for a multitude of tasks such as query size estimation,
expectation equal to F2 and variance less than 2F22 . Using standard approximate query answering, and data mining. We consider the
2 summarization of data streams using histograms. There are sev-
As discussed in Section 6.7, recently Bar-Yossef et al. [12] and eral different types of histograms that have been proposed in the
Gibbons and Tirthapura [38] have devised algorithms which, under
certain conditions, provide arbitrarily small approximation factors literature. Some popular definitions are:


without recourse to perfect hash functions.
3 V-Optimal Histogram: These approximate the distribution of
P
Hash functions with four-way independence can be obtained using
a set of values v1 ; : : : ; vn by a piecewise-constant function
standard techniques involving the use of parity check matrices of
BCH codes [65]. b
v(i), so as to minimize the sum of squared error i (vi ,

10
bv(i)) .
2
samples can take. The error associated with each quantile is the
 Equi-Width Histograms: These partition the domain into
width of this range. They periodically merge quantiles with “sim-
ilar” errors so long as the error for the combined quantile does not
buckets such that the number of vi values falling into each exceed N . This algorithm improves upon the previous set of re-
bucket is uniform across all buckets. In other words, they sults by Manku, Rajagopalan, and Lindsay [61, 62] and Chaudhuri,
maintain quantiles for the underlying data distribution as the Motwani, and Narasayya [20].
bucket boundaries.
 End-Biased Histograms: These will maintain exact counts of End-Biased Histograms and Iceberg Queries
items that occur with frequency above a threshold, and ap- Many applications maintain simple aggregates (count) over an at-
proximate the other counts by an uniform distribution. Main- tribute to find aggregate values above a specified threshold. These
taining the count of such frequent items is related to Iceberg queries are referred to as iceberg queries [32]. Such iceberg queries
queries [32]. arise in many applications, including data mining, data warehous-
ing, information retrieval, market basket analysis, copy detection,
We give an overview of recent work on computing such histograms and clustering. For example, a search engine might be interested in
over data streams. gathering search terms that account for more than 1% of the queries.
Such frequent item summaries are useful for applications such as
V-Optimal Histograms over Data Streams
caching and analyzing trends. Fang et al. [32] gave an efficient al-
Jagadish et al. [54] showed how to compute optimal V-Optimal gorithm to compute Iceberg queries over disk-resident data. Their
Histograms for a given data set using dynamic programming. The algorithm requires multiple passes which is not suited to the stream-
algorithm uses O(N ) space and requires O(N 2 B ) time, where N ing model. In a recent paper, Manku and Motwani [60] presented
is the size of the data set and B is the number of buckets. This is randomized and deterministic algorithms for frequency counting
prohibitive for data streams. Guha, Koudas and Shim [43] adapted and iceberg queries over data streams. The randomized algorithm
this algorithm to sorted data streams. Their algorithm constructs uses adaptive sampling and the main idea is that any item which
an arbitrarily-close V-Optimal Histogram (i.e., with error arbitrarily accounts for an  fraction of the items is highly likely to be a part of
close to that of the optimal histogram), using O(B 2 log N ) space a uniform sample of size 1 . The deterministic algorithm maintains
and O(B 2 log N ) time per data element. a sample of the distinct items along with their frequency. Whenever
In a recent paper, Gilbert et al. [39], removed the restriction a new item is added, it is given a benefit of doubt by over-estimating
that the data stream be sorted, providing algorithms based on the its frequency. If we see an item that already exists in the sample, its
sketching technique described earlier for computing L2 norms. The frequency is incremented. Periodically items with low frequency
idea is to view each data element as an update to an underlying are deleted. Their algorithms require O( 1 log(N )) space, where
vector of length N that we are trying to approximate using the N is the length of the data stream, and guarantee that any element
best B -bucket histogram. The time to process a data element, is undercounted by at most N . Thus, these algorithms report all
the time to reconstruct the histogram, and the size of the sketch items of count greater than N . Moreover, for all items reported,
are each bounded by poly(B; log N; 1=), where  is the relative they guarantee that the reported count is less than the actual count,
error we are willing to tolerate. Their algorithm proceeds by first but by no more than N .
constructing a robust approximation to the underlying “signal.” A
robust approximation is built by repeatedly adding a dyadic interval 6.4 Wavelets
of constant value4 which best reduces the approximation error. In Wavelets are often used as a technique to provide a summary
order to find such a dyadic interval it is necessary to efficiently representation of the data. Wavelets coefficients are projections
compute the sketch of the original signal minus the constant dyadic of the given signal (set of data values) onto an orthogonal set of
interval5 . This translates to efficiently computing the range sum basis vector. The choice of basis vectors determines the type of
of p-stable random variables (used for computing the L2 sketch, wavelets. Often Haar wavelets are used in databases for their ease
see Indyk [50]) over the dyadic interval. Gilbert et al. [39] show of computation. Wavelet coefficients have the desirable property
how to construct such efficiently range-summable p-stable random that the signal reconstructed from the top few wavelet coefficients
variables. From the robust histogram they cull a histogram of best approximates the original signal in terms of the L2 norm.
desired accuracy and with B buckets. Recent papers have demonstrated the efficacy of wavelets for dif-
ferent tasks such as selectivity estimation [63], data cube approx-
Equi-Width Histograms and Quantiles
imation [91] and computing multi-dimensional aggregates [90].
Equi-width histograms based on histograms are summary structures This body of work indicates that estimates obtained from wavelets
which characterize data distributions in a manner that is less sensi- were more accurate than those obtained from histograms with the
tive to outliers. In traditional databases they are used by optimizers same amount of memory. Chakrabarti et al. [16] propose the use
for selectivity estimation. Parallel database systems employ value of wavelets for general purpose approximate query processing and
range data partitioning that requires generation of quantiles or split- demonstrate how to compute joins, aggregations, and selections
ters that partition the data into approximately equal parts. Recently, entirely in the wavelet coefficient domain.
Greenwald and Khanna [41] presented a single-pass deterministic To extend this body of work to data streams, it becomes impor-
algorithm for efficient computation of quantiles. Their algorithm tant to devise techniques for computing wavelets in the streaming
needs O( 1 log N ) space and guarantees a precision of N . They model. In a related development, Matias, Vitter, and Wang [64]
employ a novel data structure that maintains a sample of the values show how to dynamically maintain the top wavelet coefficients ef-
seen so far (quantiles), along with a range of possible ranks that the ficiently as the underlying data distribution is updated. There has
4 been recent work in computing the top wavelet coefficients in the
A signal that corresponds to a constant value over the dyadic
interval and zero everywhere else. data stream model. The technique of Gilbert et al. [39], to approxi-
5 That is, a sketch for L norm of the difference between the original mate the best dyadic interval that most reduces the error, gives rise
2
signal and the dyadic interval with constant value. to an easy greedy algorithm to find the best B -term Haar wavelet

11
representation. This is because the Haar wavelet basis consists where N is the domain size. At first glance this lower bound and a
of dyadic intervals with constant values. This improves upon a similar lower bound in Henzinger et al. [49] may seem to contradict
previous result by Gilbert et al. [40]. If the data is presented in the frequent item-set counting results of Manku and Motwani [60].
a sorted order, there is a simple algorithm that maintains the best But note that the latter paper estimates the count of the most frequent
B -term Haar wavelet representation using O(B + log N ) space in item only if it exceeds N . Such skewed distributions are common
a deterministic manner [40]. in practice, while the lower bounds are proven for pathological dis-
While there has been lot of work on summary structures, it re- tributions where items have near-uniform frequency. This serves as
mains an interesting open problem to address the issue of global a reminder that while it may be possible to prove strong space lower
space allocation between different synopses vying for the same bounds for stream computations, considerations from applications
space. It requires that we come up with a global error metric for sometimes enable us to circumvent the negative results.
the synopses, which we minimize given the (main memory) space Saks and Sun [71] provide space lower bounds for distance ap-
constraint. Moreover, the allocation will have to be dynamic as the proximation between two vectors under the Lp norm, for p > 3,
underlying data distribution and query workload changes over time. in the data stream model. Munro and Paterson [66] showed that
any algorithm that computes quantiles exactly in p passes requires
6.5 Sliding Windows (N 1=p ) space. Space lower bounds for maintaining simple statis-
As discussed in Section 4, sliding windows prevent stale data tics like count, sum, min/max, and number of distinct values under
from influencing analysis and statistics, and also serve as a tool the sliding windows model can be found in the work of Datar et
for approximation in face of bounded memory. There has been al. [25].
very little work on extending summarization techniques to sliding A general lower bound technique for sampling-based algorithms
windows and it remains a ripe research area. We briefly describe was presented by Bar-Yossef et al. [11]. It is useful for deriving
some of the recent work. space lower bounds for data stream algorithms that resort to obliv-
Datar et al. [25] showed how to maintain simple statistics over ious sampling. It remains an interesting open problem to obtain
sliding windows, including the sketches used for computing the similar general lower bound techniques for other classes of algo-
L1 or L2 norm. Their technique requires a multiplicative space rithms for the data stream model. We feel that techniques based
overhead of O( 1 log N ), where N is the length of the sliding on communication complexity results [56] will prove useful in this
window and  is the accuracy parameter. This enables the adaptation context.
of the sketching-based algorithms to the sliding windows model.
They also provide space lower bounds for various problems in 6.7 Miscellaneous
the sliding windows model. In another paper, Babock, Datar and
In this section, we give a potpourri of algorithmic results for data
Motwani [9] adapt the reservoir sampling algorithm to the sliding
streams.
windows case. In their paper for computing Iceberg queries over
data streams, Manku and Motwani [60] also present techniques to
adapt their algorithms to the sliding window model. Guha and Data Mining
Koudas [42] have adapted their earlier paper [43], to provide a Decision trees are another form of synopsis used for prediction.
technique for maintaining V-Optimal Histograms over sorted data Domingos et al. [27, 28] have studied the problem of maintaining
streams for the sliding window model; however, they require the decision trees over data streams. Clustering is yet another way to
buffering of all the elements in the sliding window. The space summarize data. Consider the k-median formulation for clustering:
requirement is linear in the size of the sliding window (N ), although Given n data points in a metric space, the objective is to choose
update time per data element is amortized to O((B 3 =2 ) log 3 N ), k representative points, such that the sum of the errors over the
where B is the number of buckets in the histogram and  is the n data points is minimized. The “error” for each data point is the
accuracy parameter. distance from that point to the nearest of the k chosen representative
Some open problems for sliding windows are: clustering, main- points. Guha et al. [44] presented a single-pass algorithm for main-
taining top wavelet coefficients, maintaining statistics like variance, taining approximate k-medians (cluster centers) that uses O(N  )
and computing correlated aggregates [37]. space for some  < 1 using O(poly(log N )) amortized time per
data element, to compute a constant factor approximation to the
6.6 Negative Results k-median problem. Their algorithm uses a divide-and-conquer ap-
There is an emerging set of negative results on space-time require- proach which works as follows: Clustering proceeds hierarchically,
ments of algorithms that operate in data stream model. Henzinger, where a small number (N  ) of the original data points are clustered
Raghavan, and Rajagopalan [49] provided space lower bounds for into k centers. These k-centers are weighted by the number of
concrete problems in the data stream model. These lower bounds points that are closest to them in the local solution. When we get
are derived from results in communication complexity [56]. To N  weighted cluster centers by clustering different sets, we cluster
understand the connection, observe that the memory used by any them into higher-level cluster centers, and so on.
one-pass algorithm for a function f , after seeing a prefix of the data
stream, is lower bounded by the one-way communication required
by two parties trying to compute f where the first party has access Multiple Streams
to the same prefix and the second party has access to the suffix of Gibbons and Tirthapura [38] considered the problem of computing
the stream that is yet to arrive. Henzinger et al. use this approach to simple functions, such as the number of distinct elements, over
provide lower bounds for problems such as frequent item counting, unions of data stream. This is useful for applications that work in a
approximate median, and some graph problems. distributed environment, where it is not feasible to send all the data
Again based on communication complexity, Alon, Matias and to a central site for processing. It is important to note that some of
Szegedy [5] provide almost tight lower bounds for computing the the techniques presented earlier, especially those that are based on
frequency moments. In particular they proved a lower bound of sketching, are amenable to distributed computation over multiple
(N ) for estimating F1 , the count of the most frequent item, streams.

12
Reductions of Streams ing primitives such as triggers, temporal constructs, and data
In a recent paper, Bar-Yossef, Kumar, and Sivakumar [12] introduce rate management?
the notion of reductions in streaming algorithms. In order for the  Is there a need for database researchers to develop fundamen-
reductions to be efficient, one needs to employ list-efficient stream- tal and general-purpose models, algorithms, and systems for
ing algorithms. The idea behind list-efficient streaming algorithms data streams? Perhaps it suffices to build ad hoc solutions for
is that instead of being presented one data item at a time, they are each specific application (network management, web moni-
implicitly presented with a list of data items in a succinct form. If toring, security, finance, sensors etc.).
the algorithm can efficiently process the list in time that is a function
of the succinct representation size, then it is said to be list-efficient.  Are there any “killer apps” for data stream systems?
They develop some list-efficient algorithms and using the reduction
paradigm address several interesting problems like computing fre- We believe that all three questions can be answered in the affirma-
quency moments [5] (which includes the special case of counting tive, although of course only time will tell.
the number of distinct elements) and counting the number of trian- Assuming positive answers to the “meta-questions” above,we see
gles in a graph presented as a stream. They also prove a space lower several fundamental aspects to the design of data stream systems,
bound for the latter problem. To the best of our knowledge, besides some of which we discussed in detail in this paper. One impor-
this work and that of Henzinger et al. [49], there has been little work tant general question is the interface provided by the DSMS. Our
on graph problems in the streaming model. Such algorithms will approach at Stanford is to extend SQL to support stream-oriented
likely be very useful for analyzing large graphical structures such primitives, providing a purely declarative interface as in traditional
as the web graph. database systems,although we also allow direct submission of query
plans. In contrast, the Aurora project [15] provides a procedural
Property Testing “boxes and arrows” approach as the primary interface for the appli-
cation developer.
Feigenbaum et al. [34] introduced the conceptof streaming property
Other fundamental issues discussed in the paper include times-
testers and streaming spot checkers. These are programs that make
tamping and ordering, support for sliding window queries, and
one pass over the data and using small space verify if the data
dealing effectively with blocking operators. A major open question,
satisfies certain property. They show that there are properties that
about which we had very little to say, is that of dealing with dis-
are efficiently testable by a streaming-tester but not by a sampling-
tributed streams. It does not make sense to redirect high-rate streams
tester, and other problems for which the converse is true. They also
to a central location for query processing, so it becomes imperative
p p
present an efficient sampling-tester for testing the “groupedness”
property of a sequence that use O( N ) samples, O( N ) space
to push some processing to the points of arrival of the distributed
p
and O( N log N ) time. A sequence 1 ; : : : ; N is said to be
streams, raising a host of issues at every level of a DSMS [58].
grouped if i = j and i < k < j imply i = k = j , i.e., for
Another issue we touched on only briefly in Section 4.5 is that of
each type T , all occurrences of T are in a single contiguous run.
constraints over streams, and how they can be exploited in query
processing. Finally, many systems questions remain open in query
Thus, groupedness is a natural relaxation of the sortedness property
optimization, construction of synopses, resource management, ap-
and is a natural property that one may desire in a massive streaming
proximate query processing, and the development of an appropriate
data set. The work discussed here illustrates that some properties and well-accepted benchmark for data stream systems.
are efficiently testable by sampling algorithms but not streaming
From a purely theoretical perspective, perhaps the most inter-
algorithms. esting open question is that of defining extensions of relational
Measuring Sortedness operators to handle data stream constructs, and to study the re-
sulting “stream algebra” and other properties of these extensions.
Measuring the “sortedness” of a data stream could be useful in Such a foundation is surely key to developing a general-purpose
some applications; for example, it is useful in determining the well-understood query processor for data streams.
choice of a sort algorithm for the underlying data. Ajtai et al. [3]
have studied the problem of estimating the number of inversions (a
measure of sortedness) in a permutation to within a factor , where
Acknowledgements
the permutation is presented in a data stream model. They obtained We thank all the members of the Stanford STREAM research group
an algorithm using O(log N log log N ) space and O(log N ) time for their contributions and feedback.
per data element. They also prove an (N ) space lower bound for
randomized exact computation, thus showing that approximation is 8. REFERENCES
essential. [1] S. Acharya, P. B. Gibbons, and V. Poosala. Congressional
samples for approximate answering of group-by queries. In
7. CONCLUSION AND FUTURE WORK Proc. of the 2000 ACM SIGMOD Intl. Conf. on Management
We have isolated a number of issues that arise when considering of Data, pages 487–498, May 2000.
data management, query processing, and algorithmic problems in [2] S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy.
the new setting of continuous data streams. We proposed some Join synopses for approximate query answering. In Proc. of
initial solutions, described past and current work related to data the 1999 ACM SIGMOD Intl. Conf. on Management of Data,
streams, and suggested a general architecture for a Data Stream pages 275–286, June 1999.
Management System (DSMS). At this point let us take a step back [3] M. Ajtai, T. Jayram, R. Kumar, and D. Sivakumar. Counting
and consider some “meta-questions” with regard to the motivations inversions in a data stream. manuscript, 2001.
and need for a new approach. [4] N. Alon, P. Gibbons, Y. Matias, and M. Szegedy. Tracking
join and self-join sizes in limited storage. In Proc. of the
 Is there more to effective data stream systems than conven- 1999 ACM Symp. on Principles of Database Systems, pages
tional database technology with enhanced support for stream- 10–20, 1999.

13
[5] N. Alon, Y. Matias, and M. Szegedy. The space complexity Management of Data, pages 436–447, 1998.
of approximating the frequency moments. In Proc. of the [21] S. Chaudhuri, R. Motwani, and V. Narasayya. On random
1996 Annual ACM Symp. on Theory of Computing, pages sampling over joins. In Proc. of the 1999 ACM SIGMOD Intl.
20–29, 1996. Conf. on Management of Data, pages 263–274, June 1999.
[6] M. Altinel and M. J. Franklin. Efficient filtering of XML [22] S. Chaudhuri and V. Narasayya. An efficient cost-driven
documents for selective dissemination of information. In index selection tool for microsoft sql server. In Proc. of the
Proc. of the 2000 Intl. Conf. on Very Large Data Bases, 1997 Intl. Conf. on Very Large Data Bases, pages 146–155,
pages 53–64, Sept. 2000. 1997.
[7] A. Arasu, B. Babcock, S. Babu, J. McAlister, and J. Widom. [23] J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. NiagraCQ: A
Characterizing memory requirements for queries over scalable continuous query system for internet databases. In
continuous data streams. In Proc. of the 2002 ACM Symp. on Proc. of the 2000 ACM SIGMOD Intl. Conf. on Management
Principles of Database Systems, June 2002. Available at of Data, pages 379–390, May 2000.
http://dbpubs.stanford.edu/pub/2001-49. [24] C. Cortes, K. Fisher, D. Pregibon, and A. Rogers. Hancock:
[8] R. Avnur and J. Hellerstein. Eddies: Continuously adaptive a language for extracting signatures from data streams. In
query processing. In Proc. of the 2000 ACM SIGMOD Intl. Proc. of the 2000 ACM SIGKDD Intl. Conf. on Knowledge
Conf. on Management of Data, pages 261–272, May 2000. Discovery and Data Mining, pages 9–17, Aug. 2000.
[9] B. Babcock, M. Datar, and R. Motwani. Sampling from a [25] M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining
moving window over streaming data. In Proc. of the 2002 stream statistics over sliding windows. In Proc. of the 2002
Annual ACM-SIAM Symp. on Discrete Algorithms, pages Annual ACM-SIAM Symp. on Discrete Algorithms, pages
633–634, 2002. 635–644, 2002.
[10] S. Babu and J. Widom. Continuous queries over data [26] A. Dobra, J. Gehrke, M. Garofalakis, and R. Rastogi.
streams. SIGMOD Record, 30(3):109–120, Sept. 2001. Processing complex aggregate queries over data streams. In
[11] Z. Bar-Yossef, R. Kumar, and D. Sivakumar. Sampling Proc. of the 2002 ACM SIGMOD Intl. Conf. on Management
algorithms: Lower bounds and applications. In Proc. of the of Data, 2002.
2001 Annual ACM Symp. on Theory of Computing, pages [27] P. Domingos and G. Hulten. Mining high-speed data streams.
266–275, 2001. In Proc. of the 2000 ACM SIGKDD Intl. Conf. on Knowledge
[12] Z. Bar-Yossef, R. Kumar, and D. Sivakumar. Reductions in Discovery and Data Mining, pages 71–80, Aug. 2000.
streaming algorithms, with an application to counting [28] P. Domingos, G. Hulten, and L. Spencer. Mining
triangles in graphs. In Proc. of the 2002 Annual ACM-SIAM time-changing data streams. In Proc. of the 2001 ACM
Symp. on Discrete Algorithms, pages 623–632, 2002. SIGKDD Intl. Conf. on Knowledge Discovery and Data
[13] S. Bellamkonda, T. Borzkaya, B. Ghosh, A. Gupta, J. Haydu, Mining, pages 97–106, 2001.
S. Subramanian, and A. Witkowski. Analytic functions in [29] N. Duffield and M. Grossglauser. Trajectory sampling for
oracle 8i. Available at direct traffic observation. In Proc. of the 2000 ACM
http://www-db.stanford.edu/dbseminar/Archive SIGCOMM, pages 271–284, Sept. 2000.
/SpringY2000/speakers/agupta/paper.pdf. [30] D. B. et al. The New Jersey data reduction report. IEEE Data
[14] J. A. Blakeley, N. Coburn, and P. A. Larson. Updating Engineering Bulletin, 20(4):3–45, 1997.
derived relations: Detecting irrelevant and autonomously [31] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast
computable updates. ACM Trans. on Database Systems, subsequence matching in time-series databases. In Proc. of
14(3):369–400, 1989. the 1994 ACM SIGMOD Intl. Conf. on Management of Data,
[15] D. Carney, U. Cetinternel, M. Cherniack, C. Convey, S. Lee, pages 419–429, May 1994.
G. Seidman, M. Stonebraker, N. Tatbul, and S. Zdonik. [32] M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani,
Monitoring streams – a new class of dbms applications. and J. D. Ullman. Computing iceberg queries efficiently. In
Technical Report CS-02-01, Department of Computer Proc. of the 1998 Intl. Conf. on Very Large Data Bases,
Science, Brown University, Feb. 2002. pages 299–310, 1998.
[16] K. Chakrabarti, M. N. Garofalakis, R. Rastogi, and K. Shim. [33] J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan.
Approximate query processing using wavelets. In Proc. of An approximate l1-difference algorithm for massive data
the 2000 Intl. Conf. on Very Large Data Bases, pages streams. In Proc. of the 1999 Annual IEEE Symp. on
111–122, Sept. 2000. Foundations of Computer Science, pages 501–511, 1999.
[17] M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya. [34] J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan.
Towards estimation error guarantees for distinct values. In Testing and spot checking of data streams. In Proc. of the
Proc. of the 2000 ACM Symp. on Principles of Database 2000 Annual ACM-SIAM Symp. on Discrete Algorithms,
Systems, pages 268–279, 2000. pages 165–174, 2000.
[18] S. Chaudhuri, G. Das, and V. Narasayya. A robust, [35] P. Flajolet and G. Martin. Probabilistic counting. In Proc. of
optimization-based approach for approximate answering of the 1983 Annual IEEE Symp. on Foundations of Computer
aggregate queries. In Proc. of the 2001 ACM SIGMOD Intl. Science, 1983.
Conf. on Management of Data, pages 295–306, May 2001. [36] H. Garcia-Molina, W. Labio, and J. Yang. Expiring data in a
[19] S. Chaudhuri and R. Motwani. On sampling and relational warehouse. In Proc. of the 1998 Intl. Conf. on Very Large
operators. Bulletin of the Technical Committee on Data Data Bases, pages 500–511, Aug. 1998.
Engineering, 22:35–40, 1999. [37] J. Gehrke, F. Korn, and D. Srivastava. On computing
[20] S. Chaudhuri, R. Motwani, and V. Narasayya. Random correlated aggregates over continual data streams. In Proc. of
sampling for histogram construction: How much is enough? the 2001 ACM SIGMOD Intl. Conf. on Management of Data,
In Proc. of the 1998 ACM SIGMOD Intl. Conf. on

14
pages 13–24, May 2001. Data Bases, pages 275–286, 1998.
[38] P. Gibbons and S. Tirthapura. Estimating simple functions on [55] H. Jagadish, I. Mumick, and A. Silberschatz. View
the union of data streams. In Proc. of the 2001 ACM Symp. on maintenance issues for the Chronicle data model. In Proc. of
Parallel Algorithms and Architectures, pages 281–291, 2001. the 1995 ACM Symp. on Principles of Database Systems,
[39] A. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, pages 113–124, May 1995.
and M. Strauss. Fast, small-space algorithms for approximate [56] E. Kushlevitz and N. Nisan. Communication Complexity.
histogram maintenance. In Proc. of the 2002 Annual ACM Cambridge University Press, 1997.
Symp. on Theory of Computing, 2002. [57] L. Liu, C. Pu, and W. Tang. Continual queries for internet
[40] A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. scale event-driven information delivery. IEEE Trans. on
Surfing wavelets on streams: One-pass summaries for Knowledge and Data Engineering, 11(4):583–590, Aug.
approximate aggregate queries. In Proc. of the 2001 Intl. 1999.
Conf. on Very Large Data Bases, pages 79–88, 2001. [58] S. Madden and M. J. Franklin. Fjording the stream: An
[41] M. Greenwald and S. Khanna. Space-efficient online architecture for queries over streaming sensor data. In Proc.
computation of quantile summaries. In Proc. of the 2001 of the 2002 Intl. Conf. on Data Engineering, Feb. 2002. (To
ACM SIGMOD Intl. Conf. on Management of Data, pages appear).
58–66, 2001. [59] S. Madden, J. Hellerstein, M. Shah, and V. Raman.
[42] S. Guha and N. Koudas. Approximating a data stream for Continuously adaptive continuous queries over streams. In
querying and estimation: Algorithms and performance Proc. of the 2002 ACM SIGMOD Intl. Conf. on Management
evaluation. In Proc. of the 2002 Intl. Conf. on Data of Data, June 2002. (To appear).
Engineering, 2002. [60] G. Manku and R. Motwani. Approximate frequency counts
[43] S. Guha, N. Koudas, and K. Shim. Data-streams and over streaming data. manuscript, 2002.
histograms. In Proc. of the 2001 Annual ACM Symp. on [61] G. Manku, S. Rajagopalan, and B. G. Lindsay. Approximate
Theory of Computing, pages 471–475, 2001. medians and other quantiles in one pass and with limited
[44] S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan. memory. In Proc. of the 1998 ACM SIGMOD Intl. Conf. on
Clustering data streams. In Proc. of the 2000 Annual IEEE Management of Data, pages 426–435, June 1998.
Symp. on Foundations of Computer Science, pages 359–366, [62] G. Manku, S. Rajagopalan, and B. G. Lindsay. Random
Nov. 2000. sampling techniques for space efficient online computation
[45] A. Gupta, H. V. Jagadish, and I. S. Mumick. Data integration of order statistics of large datasets. In Proc. of the 1999 ACM
using self-maintainable views. In Proc. of the 1996 Intl. SIGMOD Intl. Conf. on Management of Data, pages
Conf. on Extending Database Technology, pages 140–144, 251–262, June 1999.
Mar. 1996. [63] Y. Matias, J. Vitter, and M. Wang. Wavelet-based histograms
[46] P. Haas, J. Naughton, P. Seshadri, and L. Stokes. for selectivity estimation. In Proc. of the 1998 ACM
Sampling-based estimation of the number of distinct values SIGMOD Intl. Conf. on Management of Data, pages
of an attribute. In Proc. of the 1995 Intl. Conf. on Very Large 448–459, June 1998.
Data Bases, pages 311–322, Sept. 1995. [64] Y. Matias, J. Vitter, and M. Wang. Dynamic maintenance of
[47] J. Hellerstein, M. Franklin, et al. Adaptive query processing: wavelet-based histograms. In Proc. of the 2000 Intl. Conf. on
Technology in evolution. IEEE Data Engineering Bulletin, Very Large Data Bases, pages 101–110, Sept. 2000.
23(2):7–18, June 2000. [65] R. Motwani and P. Raghavan. Randomized Algorithms.
[48] J. Hellerstein, P. Haas, and H. Wang. Online aggregation. In Cambridge University Press, 1995.
Proc. of the 1997 ACM SIGMOD Intl. Conf. on Management [66] J. Munro and M. Paterson. Selection and sorting with limited
of Data, pages 171–182, May 1997. storage. Theoretical Computer Science, 12:315–323, 1980.
[49] M. Henzinger, P. Raghavan, and S. Rajagopalan. Computing [67] B. Nguyen, S. Abiteboul, G. Cobena, and M. Preda.
on data streams. Technical Report TR 1998-011, Compaq Monitoring XML data on the web. In Proc. of the 2001 ACM
Systems Research Center, Palo Alto, California, May 1998. SIGMOD Intl. Conf. on Management of Data, pages
[50] P. Indyk. Stable distributions, pseudorandom generators, 437–448, May 2001.
embeddings and data stream computation. In Proc. of the [68] V. Poosala and V. Ganti. Fast approximate answers to
2000 Annual IEEE Symp. on Foundations of Computer aggregate queries on a data cube. In Proc. of the 1999 Intl.
Science, pages 189–197, 2000. Conf. on Scientific and Statistical Database Management,
[51] Y. E. Ioannidis and V. Poosala. Histogram-based pages 24–33, July 1999.
approximation of set-valued query-answers. In Proc. of the [69] D. Quass, A. Gupta, I. Mumick, and J. Widom. Making
1999 Intl. Conf. on Very Large Data Bases, pages 174–185, views self-maintainable for data warehousing. In Proc. of the
Sept. 1999. 1996 Intl. Conf. on Parallel and Distributed Information
[52] iPolicy Networks home page. Systems, pages 158–169, Dec. 1996.
http://www.ipolicynetworks.com. [70] V. Raman, B. Raman, and J. Hellerstein. Online dynamic
[53] Z. Ives, D. Florescu, M. Friedman, A. Levy, and D. Weld. An reordering for interactive data processing. In Proc. of the
adaptive query execution system for data integration. In 1999 Intl. Conf. on Very Large Data Bases, 1999.
Proc. of the 1999 ACM SIGMOD Intl. Conf. on Management [71] M. Saks and X. Sun. Space lower bounds for distance
of Data, pages 299–310, June 1999. approximation in the data stream model. In Proc. of the 2002
[54] H. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, Annual ACM Symp. on Theory of Computing, 2002.
K. Sevcik, and T. Suel. Optimal histograms with quality [72] U. Schreier, H. Pirahesh, R. Agrawal, and C. Mohan. Alert:
guarantees. In Proc. of the 1998 Intl. Conf. on Very Large An architecture for transforming a passive DBMS into an

15
active DBMS. In Proc. of the 1991 Intl. Conf. on Very Large [92] Xml path language (XPath) version 1.0, Nov. 1999. W3C
Data Bases, pages 469–478, Sept. 1991. Recommendation available at http://www.w3.org/TR/xpath.
[73] T. K. Sellis. Multiple-query optimization. ACM Trans. on [93] Yahoo home page. http://www.yahoo.com.
Database Systems, 13(1):23–52, 1988.
[74] P. Seshadri, M. Livny, and R. Ramakrishnan. Sequence query
processing. In Proc. of the 1994 ACM SIGMOD Intl. Conf.
on Management of Data, pages 430–441, May 1994.
[75] P. Seshadri, M. Livny, and R. Ramakrishnan. Seq: A model
for sequence databases. In Proc. of the 1995 Intl. Conf. on
Data Engineering, pages 232–239, Mar. 1995.
[76] P. Seshadri, M. Livny, and R. Ramakrishnan. The design and
implementation of a sequence database system. In Proc. of
the 1996 Intl. Conf. on Very Large Data Bases, pages
99–110, Sept. 1996.
[77] J. Shanmugasundaram, K. Tufte, D. J. DeWitt, J. F.
Naughton, and D. Maier. Architecting a network query
engine for producing partial results. In Proc. of the 2000 Intl.
Workshop on the Web and Databases, pages 17–22, May
2000.
[78] R. Snodgrass and I. Ahn. A taxonomy of time in databases.
In Proc. of the 1985 ACM SIGMOD Intl. Conf. on
Management of Data, pages 236–245, 1985.
[79] S.-. Standard. On-line analytical processing (sql/olap).
Available from http://www.ansi.org/, document #ISO/IEC
9075-2/Amd1:2001.
[80] Stanford Stream Data Management (STREAM) Project.
http://www-db.stanford.edu/stream.
[81] M. Sullivan. Tribeca: A stream database manager for
network traffic analysis. In Proc. of the 1996 Intl. Conf. on
Very Large Data Bases, page 594, Sept. 1996.
[82] D. Terry, D. Goldberg, D. Nichols, and B. Oki. Continuous
queries over append-only databases. In Proc. of the 1992
ACM SIGMOD Intl. Conf. on Management of Data, pages
321–330, June 1992.
[83] Traderbot home page. http://www.traderbot.com.
[84] P. Tucker, D. Maier, T. Sheard, and L. Fegaras. Enhancing
relational operators for querying over punctuated data
streams. manuscript, 2002. Available at
http://www.cse.ogi.edu/dot/niagara
/pstream/punctuating.pdf.
[85] J. Ullman and J. Widom. A First Course in Database Systems.
Prentice Hall, Upper Saddle River, New Jersey, 1997.
[86] T. Urhan and M. Franklin. Xjoin: A reactively-scheduled
pipelined join operator. IEEE Data Engineering Bulletin,
23(2):27–33, June 2000.
[87] S. Viglas and J. Naughton. Rate-based query optimization for
streaming information sources. In Proc. of the 2002 ACM
SIGMOD Intl. Conf. on Management of Data, June 2002. (To
appear).
[88] J. Vitter. Random sampling with a reservoir. ACM Trans. on
Mathematical Software, 11(1):37–57, 1985.
[89] J. Vitter. External memory algorithms and datastructures. In
J. Abello, editor, External Memory Algorithms, pages 1–18.
Dimacs, 1999.
[90] J. Vitter and M. Wang. Approximate computation of
multidimensional aggregates of sparse data using wavelets.
In Proc. of the 1999 ACM SIGMOD Intl. Conf. on
Management of Data, pages 193–204, June 1999.
[91] J. Vitter, M. Wang, and B. Iyer. Data cube approximation and
histograms via wavelets. In Proc. of the 1998 Intl. Conf. on
Information and Knowledge Management, Nov. 1998.

16

You might also like