0% found this document useful (0 votes)
17 views14 pages

BDA Module 4

The document outlines the architecture and components of a Data Stream Management System (DSMS), detailing layers such as data source, ingestion, processing, storage, querying, visualization, and integration. It discusses stream queries, issues in data stream query processing, and contrasts DSMS with traditional Database Management Systems (DBMS). Additionally, it covers sampling techniques and the use of Bloom filters for efficient data validation.

Uploaded by

anushkamp48
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views14 pages

BDA Module 4

The document outlines the architecture and components of a Data Stream Management System (DSMS), detailing layers such as data source, ingestion, processing, storage, querying, visualization, and integration. It discusses stream queries, issues in data stream query processing, and contrasts DSMS with traditional Database Management Systems (DBMS). Additionally, it covers sampling techniques and the use of Bloom filters for efficient data validation.

Uploaded by

anushkamp48
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

SSPM’s College Of Engineering, Kankavli

Class: BE AIML Subject: BDA

Module 4
Mining Data Streams

Data-Stream-Management System

DSMS consists of various layer which are dedicated to perform particular operation which
are as follows:
1. Data source Layer
The first layer of DSMS is data source layer as it name suggest it is comprises of all
the data sources which includes sensors, social media feeds, financial market, stock markets
etc. In the layer capturing and parsing of data stream happens. Basically it is the collection
layer which collects the data.
Any number of streams can enter the system. Each stream can provide elements at
its own schedule; they need not have the same data rates or data types, and the time
between elements of one stream need not be uniform.

2. Data Ingestion Layer


You can consider this layer as bridge between data source layer and processing layer. The
main purpose of this layer is to handle the flow of data i.e., data flow control, data buffering
and data routing.

3. Processing Layer
This layer consider as heart of DSMS architecture it is functional layer of DSMS applications.
It process the data streams in real time. To perform processing it is uses processing engines
like Apache flink or Apache storm etc., The main function of this layer is to filter, transform,

1 BDA Module 4 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE AIML Subject: BDA
aggregate and enriching the data stream. This can be done by derive insights and detect
patterns.

4. Storage Layer
Data are typically stored in three partitions:
1. Temporary working storage (e.g., for window queries).
2. Summary storage.
3. Static storage for meta-data (e.g., physical location of each source).
Once data is process we need to store the processed data in any storage unit. Storage layer
consist of various storage like NoSQL database, distributed database etc., It helps to ensure
data durability and availability of data in case of system failure.

5. Querying Layer
As mentioned above it support 2 types of query ad hoc query and standing query. This layer
provides the tools which can be used for querying and analysing the stored data stream. It
also have SQL like query languages or programming API. This queries can be question like,
How many entries are done? Which type of data is inserted? etc.

6. Visualization and Reporting Layer


This layer provides tools for perform visualization like charts, pie chart, histogram etc., On
the basis of this visual representation it also helps to generate the report for analysis.

7. Integration Layer
This layer responsible for integrating DSMS application with traditional system, business
intelligence tools, data warehouses, ML application, NLP applications. It helps to improve
already present running applications.
The layers are responsible for working of DSMS applications. It provides scalable and fault
tolerance application which can handle huge volume of streaming data. These layer can
change according to the business requirements some may include all layer some may
exclude layers.
The working store might be disk, or it might be main memory, depending on how fast
we need to process queries. But either way, it is of sufficiently limited capacity that it cannot
store all the data from all the streams.

Stream Queries
1. Standing queries (Continuous)
- queries that are asked to the stream at all times(continuous).
- Example: Alert me when electric current value exceeds 60A.
2. Adhoc queries/Snapshot queries
- queries asked one time to the straem(snapshot)
- Example: What is the average of electric current value captured so far.

2 BDA Module 4 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE AIML Subject: BDA
Abstract view of DSMS

Issues in Data Stream Query Processing


1. Unbounded Memory Requirements
Since data streams are potentially unbounded in size, the amount of storage required
to compute an exact answer to a data stream query may also grow without bound. Algorithms
that use external memory are not well-suited to data stream applications since they do not
support continuous queries and are typically too slow for real-time response. For this reason,
we are interested in algorithms that are able to confine themselves to main memory without
accessing disk.
2. Approximate Query Answering
When we are limited to a bounded amount of memory, it is not always possible to
produce exact answers for the data stream queries; however, high-quality approximate
answers are often acceptable in lieu of exact answers.
3. Sliding Windows
One technique for approximate query answering is to evaluate the query not over the
entire past history of the data streams, but rather only over sliding windows of the recent
data from the streams. For example, only data from the last week could be considered in
producing query answers, with data older than 1 week being discarded. Imposing sliding
windows on data streams is a natural method for approximation that has several attractive
properties. Most importantly, it emphasizes recent data, which in the majority of real-world
applications is more important and relevant than the old data: If one is trying in real-time to
make sense of network traffic patterns, or phone call or transaction records, or scientific
sensor data, then in general insights based on the recent past will be more informative and
useful than insights based on stale data. In fact, for many such applications, sliding windows
can be a requirement needed as a part of the desired query semantics explicitly expressed as
a part of the user’s query. A sliding window can be the most recent n elements of a stream,
for some n, or it can be all the elements that arrived within the last t time units, for example,
1 month. If we regard each stream element as a tuple, we can treat the window as a relation
and query it with any SQL query. Of course, the stream-management system must keep the
window fresh, deleting the oldest elements as new ones come in.

3 BDA Module 4 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE AIML Subject: BDA

Example 1
Suppose a user wanted to compute the average call length, but considering only the 10 most
recent long-distance calls placed by each customer. The query can be formulated as follows:
SELECT AVG(S.minutes)
FROM Calls S [PARTITION BY S.customer id
ROWS 10 PRECEDING
WHERE S.type = “Long Distance”]

Example 2
A slightly more complicated example will be to return the average length of the last 1000
telephone calls placed by “Gold” customers:
SELECT AVG(V.minutes)
FROM (SELECT S.minutes
FROM Calls S, Customers T
WHERE S.customer id = T.customer id
AND T.tier = “Gold”)
V [ROWS 1000 PRECEDING]

4. Batch Processing, Sampling and Synopses


Another class of techniques for producing approximate answers is to avoid looking at
every data element and restrict query evaluation to some sort of sampling or batch processing
technique. In batch processing, rather than producing a continually up-to-date answer, the
data elements are buffered as they arrive, and the answer to the query is computed
periodically.
Sampling is based on the principle that it is a futile attempt to make use of all the data
when computing an answer, because data arrives faster than it can be processed. Instead,
some data points must be skipped altogether, so that the query is evaluated over a sample of
the data stream rather than over the entire data stream.
For some classes of data stream queries where no exact data structure with the
desired properties exist, one can often design an approximate data structure that maintains
a small synopsis or sketch of the data rather than an exact representation, and therefore, is
able to keep computation per data element to a minimum.
5. Blocking Operators
A blocking query operator is a query operator that is unable to produce an answer
until it has seen its entire input. Sorting is an example of a blocking operator, as are
aggregation operators such as SUM, COUNT, MIN, MAX and AVG. If one thinks about
evaluating the continuous stream queries using a traditional tree of query operators, where
data streams enter at the leaves and final query answers are produced at the root, then the
incorporation of the blocking operators into the query tree poses problems.
Difference between DBMS and DSMS:

4 BDA Module 4 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE AIML Subject: BDA

S.No. Data Base Management System Data Stream Management System


(DBMS) (DSMS)
01. DBMS deals with persistent data. DSMS deals with stream data.
02. In DBMS random data access takes In DSMS sequential data access takes
place. place.
03. It is based on Query Driven processing It is based on Data Driven processing
model i.e called pull based model. model i.e called push based model.
04. In DBMS query plan is optimized at DSMS is based on adaptive query plans.
beginning/fixed.
05. The data update rates in DBMS is The data update rates in DSMS is
relatively low. relatively high.
06. In DBMS the queries are one time But in DSMS the queries are continuous.
queries.
07. In DBMS the query gives the exact In DSMS the query gives the
answer. exact/approximate answer.
08. DBMS provides no real time service. DSMS provides real time service.
09. DBMS uses unbounded disk store DSMS uses bounded main memory
means unlimited secondary storage. means limited main memory.

Sampling Data techniques in a Stream


What is Sampling?
- Process of collecting a representative collection of elements from entire stream.
- Sample is usually very smaller that entire stream data.
- Retains all the significant characteristics and behaviour of the stream.
- Used to estimate/predict many crucial aggregates on the stream.

Sampling Techniques
1. Fixed proportion sampling
2. Fixed size sampling
3. Biased reservoir sampling
4. Concise sampling

1. Fixed proportion sampling


- It samples data with fixed proportion/percentage
- Used when you are aware of length of data.
- Ensures representative sample i.e. sample that can retain almost all the characteristics of
entire data stream.
- It is useful for large volumes e.g. if billions of records are there.
- Two Problems: May lead to under or over representation.
o Under representation means sample will not represent well.
o Over representation means sample may over represent data.
- Example:

5 BDA Module 4 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE AIML Subject: BDA
A social media platform want to analyse the sentiments of its users towards a
topic. They receive millions of tweets per day and used fixed proportion sampling to select
a representative sample. They randomly select 1% of the tweets received each hour
ensuring a representative sample for statistical analysis of user sentiments towards the
topic.

2. Fixed size sampling


- Samples fixed number of data points.
- Does not guarantee representative sample.
- Useful for reducing data volume.
- Can be biases if data is not randomly distributed.
- Less effective when data size increases.
- Example
Suppose we have data stream of customer orders for an online store, with 10,000 orders
coming in every hour. Using fixed size sampling, we randomly select 1,000 orders from
each hour’s data stream for analysis, thus reducing the total number of data points to
process from 10,000 to 1,000 per hour.

3. Reservoir sampling
- Reservoir-based methods were originally proposed in the context of one-pass access of
data from magnetic storage devices such as tapes.
- As in the case of streams, the number of records N is not known in advance and the
sampling must be performed dynamically as the records from the tape are read.
- Let us assume that we wish to obtain an unbiased sample of size n from the data stream.
- In this algorithm, we maintain a reservoir of size n from the data stream.
- The first n points in the data streams are added to the reservoir for initialization.
- Subsequently, when the (t + 1)th point from the data stream is received, it is added to the
reservoir with probability n/(t + 1).
- In order to make room for the new point, any of the current points in the reservoir are
sampled with equal probability and subsequently removed.
- Example
Reservoir sampling can be used to obtain a sample of size k from a population of people
with brown hair.

4. Biased reservoir sampling


- Used I streams to select a subset of the data in a way that is not uniformly random.
- Can lead to a biased sample that may not be representative of the full dataset.
- The selection of elements is based on a predetermined probability distribution that may
be weighted towards certain elements or group of elements.
- The probability distribution used for biased reservoir sampling may be based on various
factors, such as the frequency of occurrence of certain types of data or the importance of
certain data points.

6 BDA Module 4 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE AIML Subject: BDA
- Used when there are constraints on the resources available for sampling, such as limited
memory or computational power.
- It is important to carefully consider the potential biases introduced by this sampling
technique and adjust the analysis accordingly.
- Example
Suppose we have a data stream of product ratings and we want to select a sample of
ratings to estimate the average rating of a product. However, we know that some users
tend to give higher ratings than others. Using biased reservoir sampling, we can assign a
higher probability of selection to ratings from users who tend to give more accurate
ratings. This way our sample is more likely to represent the true average ratings. This way,
our sample is more likely to represent the true average rating of the product.

5. Concise sampling
- Goal is to maintain a small reservoir of a fixed size while still achieving representative
sampling of the data stream.
- Number of samples that can be stored in memory at a given time is limited, which can be
a challenge when dealing with large data streams.
- Size of the sample may need to be adjusted based on the amount of memory available to
store the data.
- Instead of selecting samples randomly, the sampling algorithm may prioritize choosing
samples with unique or representative values of a particular attribute in the data stream.
- Example
o A bank want to analyse customer spending habit from a stream of transactions.
o They use concise sampling to choose distinct customer IDs as their attribute.
o The size of the reservoir is limited to 1000 customers.
o They adjust the sample size base on available memory.
o This allows for efficient analysis while maintaining accuracy.

Filtering Streams: Bloom Filter with Analysis.


1. Overview
I’m sure many of us must have seen the warning message – the username or e-mail is already
taken. If we notice carefully, the warning message appears within a few seconds. All websites
perform this validation during the sign-up process.
Have we ever wondered how websites validate millions of records within seconds? One of
the solutions is using the Bloom filter.

2. What Is a Bloom Filter?

Bloom filter is a probabilistic data structure. It’s used to test whether an element is a
member of a set. Of course, one can achieve the same result using other data structures as
well. However, the Bloom filter does this in space and time-efficient way.

7 BDA Module 4 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE AIML Subject: BDA
Let’s understand how the Bloom filter is implemented. Under the hood, the Bloom filter is
just an array of bits with all bits set to zero initially. The below diagram depicts the Bloom
filter of size 19:

3. Bloom Filter Operations

Bloom filter supports only two operations, Insert and Lookup. Let’s understand them with
the example:

3.1. Insert

We can perform Insert operations in constant space and time complexity. Bloom filter
performs the below steps for Insert operation:

1. Hash the input value


2. Mod the result by the length of the array
3. Set corresponding bit to 1

Let’s illustrate this by inserting two strings – John Doe and Jane Doe into the Bloom filter. Let’s
assume that hash values for these two strings are 1355 and 8337, respectively. Now, let’s
perform the modulo operation to get an index within the bounds of the bit array: 1355%19 =
6 and 8337%19 = 15.
The Bloom filter will look like this after inserting values:

3.2. Lookup

We can perform a Lookup operation in constant time complexity. Bloom filter performs the
below steps as a part of the Lookup operation:

1. Hash the input value


2. Mod the result by the length of the array
3. Check if the corresponding bit is 0 or 1

8 BDA Module 4 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE AIML Subject: BDA
If the bit is 0, then that input definitely isn’t a member of the set. But if the bit is 1, then
that input might be a member of a set. So let’s check if string John is present or not in the
Bloom filter:

In this example, the bit at the 10th index is 0, which indicates that the given string isn’t present
in the Bloom filter.

4. False Positive Analysis

Bloom filter is a space and time-efficient data structure. However, the tradeoff for that
efficiency is that it’s probabilistic in nature. It means that searching for a nonexistent
element can give an incorrect answer. In this section, we’ll understand a false-
positive scenario and a possible solution to reduce its frequency.
Let’s check if the string James Bond is present or not in the Bloom filter. Assume that the hash
value of the input string is 1355, which is the same as John Doe.

In this example, the Lookup operation returns the true result. However, we never inserted
the string James Bond in the Bloom filter.
The false-positive scenario occurs due to hash collision. We can use multiple hash functions
to reduce the hash collision frequency. So instead of setting only one bit, multiple bits will
be set for a single input.

5. Applications of the Bloom Filter

Bloom filters are used by many popular applications due to their efficiency. Let’s discuss some
of their use cases:

 Spell Checker: In the early days, spell checkers were implemented using the Bloom
filter

9 BDA Module 4 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE AIML Subject: BDA
 Databases: Many popular databases use Bloom filters to reduce the costly disk
lookups for non-existent rows or columns. This technique is used
by PostgreSQL, Apache Cassandra, Cloud Bigtable, etc
 Search Engines: BitFunnel is a search engine indexing algorithm. It uses the Bloom
filter for its search indexes
 Security: Bloom filters are used to detect weak passwords, malicious URLs, etc

6. Limitations of the Bloom Filter

Though the Bloom filter is a space and time-efficient data structure, there are a few
limitations of it:

 The naive implementation of the Bloom filter doesn’t support the delete operation.
 The false-positives rate can be reduced but can’t be reduced to zero.

Counting Distinct Elements in a Stream


In computer science, the count-distinct problem (also known in applied mathematics
as the cardinality estimation problem) is the problem of finding the number of distinct
elements in a data stream with repeated elements. This is a well-known problem with
numerous applications. The elements might represent IP addresses of packets passing
through a router, unique visitors to a web site, elements in a large database, motifs in
a DNA sequence, or elements of RFID/sensor networks.

Count-Distinct Problem
Let us first give a formal description of the problem in the following way:
Problem: Given a stream X = < x1, x2, …, xm > ∈ [n] m of values. Let F0 be the number
of distinct elements in X. Find F0 under the following constraints on data stream algorithms.
Constraints: The constraints are as follows:
1. Elements in stream are presented sequentially and single pass is allowed.
2. Limited space to operate. Avoid swaps form secondary memory. Expected space complexity
0(log (min (n, m)) or smaller.
3. Estimation guarantees: With error Ɛ < 1 and high probability.

For example consider the stream with instance, {a, b, a, c, d, b, d }


F0 = 4 = |{a, b, c, d }|.

Flajolet and Martin presented the first small space distinct-values estimation algorithm
called the Flajolet–Martin (FM) algorithm.

Flajolet-Martin Algorithm
The Flajolet-Martin algorithm is also known as probabilistic algorithm which is manly
used to count the number of unique elements in a stream or database. This algorithm was

10 BDA Module 4 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE AIML Subject: BDA
invented by Philippe Flajolet and G. Nigel Martin in 1983 and since then it has been used in
various applications such as, data mining and database management.
The Flajolet-Martin algorithm is a single pass algorithm. If there are m distinct
elements in a universe comprising of n elements, the algorithm runs in O(n) time and
O(log(m)) space complexity.
The basic idea to which Flajolet-Martin algorithm is based on is to use a hash
function to map the elements in the given dataset to a binary string, and to make use of the
length of the longest null sequence in the binary string as an estimator for the number of
unique elements to use as a value element.
The steps for the Flajolet-Martin algorithm are:
1. First step is to choose a hash function that can be used to map the elements in the
database to fixed-length binary strings. The length of the binary string can be chosen
based on the accuracy desired.
2. Next step is to apply the hash function to each data item in the dataset to get its binary
string representation.
3. Next step includes determining the position of the rightmost zero in each binary string.
4. Next we compute the maximum position of the rightmost zero for all binary strings.
5. Now we estimate the number of distinct elements in the dataset as 2 to the power of
the maximum position of the rightmost zero which we calculated in previous step.
The accuracy of Flajolet Martin Algorithm is determined by the length of the binary
strings and the number of hash functions it uses. Generally, with increase in the length of
the binary strings or using more hash functions in algorithm can often increase the
algorithm’s accuracy.
Example 1:
S=1,3,2,1,2,3,4,3,1,2,3,1
h(x)=(6x+1) mod 5
Assume |b| = 5
x h(x) Rem Binary r(a)
1 7 2 00010 1
3 19 4 00100 2
2 13 3 00011 0
1 7 2 00010 1
2 13 3 00011 0
3 19 4 00100 2
4 25 0 00000 0
3 19 4 00100 2
1 7 2 00010 1
2 13 3 00011 0
3 19 4 00100 2
1 7 2 00010 1
R = max( r(a) ) = 2
So no. of distinct elements = N=2R=22=4

Example 2

11 BDA Module 4 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE AIML Subject: BDA
Suppose the stream is 1, 3, 2, 1, 2, 3, 4, 3, 1, 2, 3, 1 ….
Let h(x) = 3x + 1 mod 5.
• So the transformed stream ( h applied to each item) is 4,5,2,4,2,5,3,5,4,2,5,4
• Each of the above element is converted into its binary equivalent as 100, 101, 10, 100,
10, 101, 11, 101, 100, 10, 101, 100
• We compute r(a) of each item in the above stream: 2, 0, 1, 2, 1, 0, 0, 0, 2, 1, 0, 2
• So R = maximum r(a), which is 2. Output 22 = 4.

Space Requirements
Observe that as we read the stream, it is not necessary to store the elements seen.
The only thing we need to keep in main memory is one integer per hash function; this integer
records the largest tail length seen so far for that hash function and any stream element. If
we are processing only one stream, we could use millions of hash functions, which is far more
than we need to get a close estimate. Only if we are trying to process many streams at the
same time would main memory constrain the number of hash functions, we could associate
with any one stream. In practice, the time it takes to compute hash values for each stream
element would be the more significant limitation on the number of hash functions we use.

Counting Ones in a Window:


Here we discuss counting problems for streams. Suppose we have a window of length N on a
binary stream. We want at all times to be able to answer queries of the form “how many 1’s
are there in the last k bits?” for any k ≤ N. We now turn our attention to counting problems
for streams. Suppose we have a window of length N on a binary stream. We want at all times
to be able to answer queries of the form “how many 1’s are there in the last k bits?” for any
k ≤ N.

The Datar-Gionis-Indyk-Motwani Algorithm


- Commonly called as Motwani Algorithm.
- Designed to find the number 1’s in a data stream.
- This algorithm uses O(log²N) bits to represent a window of N bit,
- Error rate is no more than 50%.
- So this algorithm gives a 50% precise answer.

Elements/Components

1. Timestamps
- Each element entering the stream will be allotted a timestamp based on the position of
it.
- Example: if the first bit has a timestamp 1, the second bit has a timestamp 2 and so on.

2. Buckets
- Used to represent time intervals in a data stream.
- Algorithm divides the stream into buckets, each will have size of power of two.
- Bucket contains the 0 and 1.

12 BDA Module 4 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE AIML Subject: BDA
Rules for forming a bucket:

1. Every bucket should contain at least a single 1 in it.

10100011 0000000
Not Valid
Valid bucket
bucket

2. Right side (rightmost) of the bucket should strictly start from 1.

10100011 1101000
Not Valid
Valid bucket
bucket

3. Length of the bucket is equal to the number of 1’s in it.

10100011 Length of bucket is 4

4. Every bucket length should be in the power of two.


i.e. 20, 21, 22, 23, 24, 25...
i.e. 1, 2, 4, 8, 16, 32...

5. As we move to left, the bucket size should not decrease.


E. g. 101000110110100011

6. No more than two buckets can have same size.

Consider same example again

Example
Consider the following stream:
Stream = 10101111001000110101
N = 20 Total number of elements

13 BDA Module 4 notes by P A Ghadigaonkar


SSPM’s College Of Engineering, Kankavli
Class: BE AIML Subject: BDA
Question: To form buckets for given string.

 Now consider same string with new bit 0 has got introduced within it.
If new bit entering in stream is zero, no change is required in the buckets.

 Consider previous stream, now new bit is again introduced in stream is 1


If new bit entering is 1, alter the arrangement of buckets.
As there is only one bucket of length 1 so we can alter bucket like below.

 If we add one more bit i.e. 1then we have to merge the bucket and rearrange.

So new bucket arrangement is

Therefore, number of 1’s in given stream are calculated as


4+2+2+1=9

14 BDA Module 4 notes by P A Ghadigaonkar

You might also like