0% found this document useful (0 votes)

25 views23 pages

BDA MINING DATA STREAM NOTES New

This consists notes of bda of 2 nd chapter

Uploaded by

Omkar Hajare

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views23 pages

BDA MINING DATA STREAM NOTES New

This consists notes of bda of 2 nd chapter

Uploaded by

Omkar Hajare

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

What is data stream .

explain data stream model in the context of big data with example

1. What is a Data Stream?

A data stream is a continuous, unbounded flow of data generated in real-time from various
sources. Unlike traditional data (which is stored in databases or files before being
processed), a data stream arrives continuously and often at very high speed.

📌 Example sources:
Sensors in IoT devices

Financial stock market feeds

Clickstreams from websites

Social media updates (tweets, likes, comments)

GPS signals from vehicles

---

2. Data Stream Model (in Big Data context)

In the big data world, data is not only large in volume but also fast in velocity. The data
stream model is used when data is too fast or too large to store entirely before processing.

Instead of storing everything, the system processes data on the fly as it arrives.

Key Characteristics of Data Stream Model

1. Continuous arrival: Data keeps coming without end.

2. Real-time or near real-time processing: Must be analyzed as it arrives.

3. One-pass algorithms: Often you can only see data once (not reprocess it later).

4. Memory limitation: Since streams are infinite, only small parts are stored temporarily.

5. Approximation: Instead of exact results, systems may use sampling, summarization, or

probabilistic algorithms.
---

3. Example in Big Data Context

👉 Suppose we have a bank’s fraud detection system.

Traditional batch model: Collect all transaction logs, store them in a database, and analyze
once every night. (Too slow — fraud might already happen).

Data stream model: Each transaction is analyzed immediately as it occurs.

🚨
If a credit card is suddenly used in two different countries within minutes → suspicious
activity alert .

📌 Here, the data stream model allows real-time fraud detection instead of delayed batch
processing.

---

4. Real-life Technologies for Data Streams

Apache Kafka → Messaging system for handling real-time streams.

Apache Flink / Apache Spark Streaming → Frameworks to process streams in real time.

Google Dataflow / AWS Kinesis → Cloud services for stream processing.

---

✅ In short:
A data stream is an ongoing flow of data that needs immediate or near-immediate
processing. The data stream model in big data helps handle massive, fast-moving, and
continuous data by processing it in real time rather than storing it first.

2. What is a Data Stream Management System (DSMS)?

A Data Stream Management System (DSMS) is a software system designed to process,
store (temporarily), and query continuous streams of data in real time.

Think of it as the streaming version of a Database Management System (DBMS):

A DBMS works with static data stored in tables (queries run on data at rest).

A DSMS works with dynamic, continuously arriving data streams (queries run on data in
motion).

---

2. Why DSMS is Needed?

In the era of Big Data, many applications need to process real-time data rather than waiting
for batch processing. Traditional DBMS can’t keep up because:

Data streams are unbounded (no end).

Queries must run continuously (not just once).

Only a small fraction of data can be stored due to memory/space limits.

So, DSMSs are built to handle this by using continuous queries and windowing techniques.

---

3. Key Features of DSMS

1. Continuous Queries (CQ):

Instead of running once, queries keep running on incoming streams.

Example: “Continuously monitor the average temperature from IoT sensors in the last 10
minutes.”

2. Windowing:

Since streams are infinite, DSMS uses windows to look at limited portions of the stream.

Types of windows:
Sliding window (last N seconds/records)

Tumbling window (fixed, non-overlapping chunks)

3. Low Latency Processing:

Provides results in real time or near real time.

4. Integration with Sources & Sinks:

Can ingest data from sensors, logs, social media, transactions etc.

Can output results to dashboards, alerts, databases.

---

4. Examples of DSMS

Academic systems: Aurora, STREAM, TelegraphCQ.

Industry systems:

Apache Kafka Streams

Apache Flink

Apache Spark Streaming

Google Dataflow

AWS Kinesis

---

5. Example Use Case

👉 Smart Traffic Monitoring System:
Data streams come from traffic cameras and sensors.

DSMS continuously queries the stream to detect:

Average vehicle speed in the last 5 minutes.

Number of vehicles crossing a signal per minute.

Sudden congestion alerts (if speed < 20 km/h for 2 consecutive minutes).

Without a DSMS, storing all raw data first would be too slow and too large.

---

✅ In short:
A Data Stream Management System (DSMS) is like a database system but for real-time,
continuous, and unbounded data streams. It supports continuous queries, windowing, and
real-time analytics on fast-moving data.

Perfect 👍 Let’s list out examples of stream sources — i.e., where continuous, real-time data
streams come from.

---

1. Sensor & IoT Devices

Temperature sensors

Smart meters (electricity, water, gas)

Heart rate monitors / fitness bands

Vehicle GPS devices

Industrial IoT machines (predictive maintenance)

---

2. Financial Systems
Stock market tickers (real-time price feeds)

Online banking & credit card transactions

Cryptocurrency trading platforms

---

3. Web & User Interaction

Clickstream data (user clicks, searches, page visits on websites)

Online gaming events (player actions in real time)

E-commerce transactions (Amazon, Flipkart carts & purchases)

---

4. Social Media & Communication

Twitter/X tweets, likes, retweets (real-time firehose)

Facebook/Instagram posts, comments, live reactions

WhatsApp/Telegram message activity streams

---

5. System & Application Logs

Server logs (errors, requests, CPU usage)

Application logs (events happening continuously)

Security logs (login attempts, suspicious activities)

---

6. Multimedia Streams
Live video streaming (YouTube Live, Twitch)

Audio streaming (Spotify, radio services)

Surveillance cameras (CCTV feeds)

---

7. Transportation & Mobility

Flight radar data (real-time flight positions)

Ride-hailing apps (Uber, Ola, Lyft driver/passenger locations)

Public transport tracking (bus/train GPS updates)

---

✅ In short: Stream sources are any systems, devices, or applications that generate data
continuously in real time.

Got it 👍 Let’s break down standing queries vs. ad hoc queries clearly with examples.
---

1. Ad Hoc Queries

Definition:
A query that is written and executed once on stored data.

Nature:

Run on demand.

Works on static data (database tables).

Produces a one-time result.

Example (Bank Database):

Query:

SELECT AVG(amount)
FROM transactions
WHERE date BETWEEN '2025-09-01' AND '2025-09-15';

Meaning: “Get the average transaction amount in the first two weeks of September.”

✅ Runs once → produces a single result.

---

2. Standing Queries

Definition:
A query that runs continuously on incoming data streams and updates results in real-time.

Nature:

Registered once and stays active.

Works on dynamic, continuous streams.

Produces real-time or continuous results.

Example (Bank Data Stream):

Query (conceptual, in a DSMS like Apache Flink):

Continuously monitor transactions and alert if

amount > 50,000 in the last 5 minutes.

Meaning: Fraud detection system keeps watching all new transactions.

✅ Runs continuously → generates alerts whenever condition is met.

---

3. Comparison Table

Feature Ad Hoc QueryStanding Query

Data type Stored, static data Continuous, streaming data
Execution One-time Continuous (keeps running)
Result Single result set Ongoing/real-time results
Use case Business reports, analytics Fraud detection, traffic monitoring
System DBMS DSMS / Stream Processing Systems

---

✅ In short:
Ad hoc query → "Run now, give me a report once."

Standing query → "Keep watching the data and notify me whenever the condition happens.”

Issues in stream processing

Great question 👍 — stream processing is powerful, but it comes with challenges. Let’s go
through the main issues in stream processing clearly:

---

1. Unbounded & Continuous Data

Problem: Streams are infinite; you can’t store all the data.

Impact: Must process data in real time with limited memory.

Example: Social media firehose (Twitter/X) generates millions of events per second —
impossible to save all.

---

2. High Velocity & Volume

Problem: Data arrives at very high speed and in huge amounts.

Impact: Systems may struggle to ingest and process without delays.

Example: Stock market feeds — even a small lag of milliseconds can cause financial loss.
---

3. Limited Memory & Storage

Problem: Unlike batch systems, you can’t keep all past data in memory.

Impact: Need windowing, sampling, and summarization to handle streams.

---

4. Latency vs. Accuracy Trade-off

Problem: Real-time results may require approximations.

Impact: Exact answers may be delayed; faster results may be less accurate.

Example: Network monitoring may show an approximate number of packets dropped instead
of exact count.

---

5. Handling Out-of-Order & Late Data

Problem: Events don’t always arrive in the right sequence (network delays, clock
differences).

Impact: Must handle event time vs. processing time.

Example: GPS data from vehicles may arrive late or in wrong order due to poor network.

---

6. Fault Tolerance & Reliability

Problem: If the system crashes, stream data may be lost (since it’s not stored permanently).

Impact: Need mechanisms like checkpointing, replication, replay.

---
7. Scalability

Problem: Data sources can suddenly spike (e.g., viral trend on Twitter).

Impact: Must dynamically scale up/down processing systems.

---

8. Query Optimization

Problem: Standing queries run continuously, consuming CPU and memory.

Impact: Must optimize queries and avoid reprocessing overhead.

---

9. Integration with Heterogeneous Sources

Problem: Streams may come in different formats and protocols (JSON, XML, CSV, sensor
formats).

Impact: Need data cleaning & schema unification in real time.

---

10. Security & Privacy

Problem: Continuous data may include sensitive information (bank transactions, health
data).

Impact: Must ensure encryption, authentication, and compliance while processing streams.

---

✅ In short:
Issues in stream processing include infinite data, high speed, memory limits,
late/out-of-order events, fault tolerance, scalability, query optimization, integration
challenges, and security concerns.
What is sampling of data in a stream ? How we obtain represtive sample

Good one 👍 — let’s carefully go through sampling in data streams.

---

1. What is Sampling of Data in a Stream?

Definition:
Sampling is the process of selecting a small subset of elements from a continuous data
stream, so that the subset is representative of the entire stream.

Why we need it?

Streams are infinite and fast, we cannot store or process all data.

Sampling reduces data size while keeping important characteristics.

Used for approximation, monitoring, and analysis in real time.

---

2. Challenges in Sampling Stream Data

Data is unbounded (infinite).

We cannot know the total size (N) of the stream in advance.

We may only see each element once (no re-scanning).

Because of this, we need special algorithms to get representative samples.

---

3. How to Obtain Representative Samples

A. Random Sampling

Each element in the stream has an equal probability of being included in the sample.
Example: From a Twitter stream, randomly pick 1 out of every 100 tweets.

---

B. Reservoir Sampling (most common)

Works when you don’t know the total stream size in advance.

Algorithm:

1. Keep the first k elements in the reservoir (sample of size k).

2. For the i-th incoming element (i > k), select it with probability k/i.

3. If selected, replace a random element from the reservoir.

Ensures each element has equal probability of being included.

Example: Keep a random sample of 1000 transactions from a continuous bank stream.

---

C. Systematic Sampling

Select every n-th element from the stream.

Example: Pick every 50th sensor reading from a temperature sensor.

Simple, but may miss patterns if data is periodic.

---

D. Stratified Sampling

Divide the stream into strata (groups) and sample proportionally from each.
Example: In e-commerce transactions, ensure samples contain correct ratios of
“electronics,” “fashion,” and “grocery” purchases.

---

E. Priority/Weighted Sampling

Give more importance (weight) to certain data points.

Example: In network monitoring, give higher sampling probability to error packets compared
to normal packets.

---

4. Example in Big Data Context

👉 Suppose we want to monitor website activity:

1 million clicks per minute → cannot store all.

Using reservoir sampling with sample size = 10,000, we can still estimate:

Popular pages

User activity distribution

Peak usage trends

---

✅ In short:
Sampling in streams = selecting a smaller, representative subset of data from an infinite flow.

Representative samples are obtained using algorithms like reservoir sampling, random
sampling, systematic sampling, stratified sampling, and weighted sampling, depending on
application needs.

—
Explain the filtering process of data streams with sutaible example

Perfect 👍 let’s break down filtering in data streams clearly with examples.
---

1. What is Filtering in Data Streams?

Definition:
Filtering is the process of removing unwanted or irrelevant data from a continuous data
stream so that only useful events are passed for further processing.

Why needed?

Streams contain huge volumes of mixed data.

Not all data is important → filtering reduces load, improves efficiency, and saves storage.

---

2. How Filtering Works

Filtering applies conditions (predicates) on incoming stream elements:

If an element satisfies the condition → it passes through.

If it fails the condition → it is discarded.

This is similar to using a WHERE clause in SQL, but applied continuously on live data.

---

3. Example Scenarios

A. Stock Market Stream

Incoming data: Stream of all stock trades from NSE/BSE.

Filter condition: Only keep trades where price_change > 5%.

Result: System monitors only high-volatility stocks in real time.

---

B. Social Media Stream

Incoming data: Continuous tweets from Twitter API.

Filter condition: Only select tweets containing the keyword "Cybersecurity".

Result: Dashboard shows live cybersecurity discussions.

---

C. Sensor/IoT Data

Incoming data: Temperature readings from 10,000 IoT sensors.

Filter condition: Only forward readings where temperature > 50°C.

Result: Alert system triggers overheating alarms in factories.

---

4. Techniques Used for Filtering

1. Attribute-based filtering → Condition on field values

e.g., WHERE transaction_amount > 50,000.

2. Content-based filtering → Condition on message content

e.g., Tweets containing “#Election2025”.

3. Probabilistic/Approximate filtering → Use data structures like Bloom Filters

e.g., Filter duplicate user IDs in a stream.

---

5. Real-Life Technologies

Apache Kafka Streams → filter function

Apache Flink / Spark Streaming → filter() operator on streams

SQL-like DSMS → SELECT * FROM stream WHERE condition

---

✅ In short:
The filtering process in data streams continuously removes irrelevant data based on
conditions, letting only useful data pass for further analysis.

—
Explain concept of bloom filter using an example

Nice question 👍 Bloom filters are one of the most famous concepts in data stream
processing. Let me explain it step by step with an example.

---

1. What is a Bloom Filter?

A Bloom filter is a probabilistic data structure used to test whether an element is possibly in a
set or definitely not in the set.

✅ It saves memory and is very fast.

❌ It may give false positives (says element is present when it’s not),
but never false negatives (if it says not present, then 100% not present).

So:

"No" → guaranteed correct.

"Yes" → maybe correct (small chance of error).

---

2. Why Use Bloom Filters in Data Streams?

Data streams are huge and continuous → storing all past items is impossible.

Bloom filters allow us to quickly check membership (whether we’ve seen an element before)
with very little memory.

---

3. How Bloom Filter Works (Concept)

1. Start with a bit array of size m, all bits = 0.

2. Use k independent hash functions.

3. For each element added:

Apply all k hash functions → get k positions in the array.

Set those positions to 1.

4. To check membership:

Hash the element again with the same k functions.

If all corresponding bits = 1 → element may exist.

If any bit = 0 → element definitely does not exist.

---

4. Example
👉 Suppose we have a stream of email addresses entering a spam filter.
Step 1: Initialize

Bit array: [0,0,0,0,0,0,0,0,0,0] (size = 10).

Hash functions: h1(x), h2(x) (2 functions).

Step 2: Insert "[email protected]"

h1("[email protected]") → 3, h2("[email protected]") → 7.

Set bits at positions 3 and 7 →

[0,0,0,1,0,0,0,1,0,0].

Step 3: Insert "[email protected]"

h1("[email protected]") → 2, h2("[email protected]") → 7.

Set bits at positions 2 and 7 →

[0,0,1,1,0,0,0,1,0,0].

Step 4: Query "[email protected]"

Hash → 3 and 7 → both bits are 1 ✅ → "Maybe present".

Step 5: Query "[email protected]"

Hash → 5 and 8 → bits are 0 ❌ → "Definitely not present".

---

5. Applications of Bloom Filters

Web browsers → Check if a URL is in a blacklist.

Databases (e.g., Cassandra, HBase) → Quickly check if a key exists before querying disk.

Networking → Detect duplicate packets in routers.

Big Data Streams → Membership testing when storing all data is impossible.
---

✅ In short:
A Bloom filter is a memory-efficient, probabilistic way to test membership in a stream. It
never misses actual items, but may mistakenly say an unseen item exists (false positive).

How to count distinct elements in a stream . explain flajolet Martin algorithm

Excellent 👌 — this is a classic streaming problem. Let’s go step by step.

---

1. The Problem: Counting Distinct Elements in a Stream

In data streams, data arrives continuously and can be huge (millions/billions of items).

Storing all elements to count unique values is impossible (memory + time issues).

We need a probabilistic algorithm that uses very little memory but gives a good
approximation.

👉 Flajolet–Martin (FM) Algorithm solves this.

---

2. Flajolet–Martin Algorithm (Concept)

The idea is to estimate the number of distinct elements (n) in a stream using hashing + bit
patterns.

Steps:

1. Hash each element in the stream into a large bit string (using a hash function that
distributes values uniformly).

2. For each hashed value, find the position of the rightmost 1-bit (least significant 1).

Example: If hash = 101100 → rightmost 1 at position 2 (counting from right, starting at 1).
3. Keep track of the maximum position R seen so far.

4. Estimate distinct count as:

n \approx 2^R

---

3. Example

👉 Suppose our stream is:

{a, b, c, a, b, d}

Step 1: Hash elements

Let’s assume a simple hash function that produces 6-bit values:

h(a) = 010100

h(b) = 011000

h(c) = 001000

h(d) = 100100

Step 2: Find rightmost 1-bit

h(a) = 010100 → rightmost 1 at position 3

h(b) = 011000 → rightmost 1 at position 4

h(c) = 001000 → rightmost 1 at position 3

h(d) = 100100 → rightmost 1 at position 3

Step 3: Track maximum

Max position R = 4
Step 4: Estimate distinct count

n \approx 2^R = 2^4 = 16

Actual distinct elements = 4 (a, b, c, d)

Estimate overshoots (that’s expected, it’s approximate).

---

4. Improving Accuracy

One FM estimate can be noisy.

In practice:

Run multiple hash functions (get many R values).

Use averaging/median to improve accuracy.

This extension is called LogLog or HyperLogLog (used in Google BigQuery, Redis, etc.).

---

5. Applications

Counting unique visitors on a website (can’t store all user IDs).

Counting unique IP addresses in network traffic.

Big data analytics in Hadoop/Spark for distinct counts.

Databases (PostgreSQL, Redis) use HyperLogLog for distinct counts.

---

✅ In short:
The Flajolet–Martin algorithm estimates the number of distinct elements in a data stream
using hashing and the position of the rightmost 1-bit.

It’s memory efficient,

approximate but fast,

scalable for big data streams.

Big Data Analytics Module 4 Mumbai University
No ratings yet
Big Data Analytics Module 4 Mumbai University
24 pages
Bda 2
No ratings yet
Bda 2
16 pages
Module-2-MINING DATA STREAMS
100% (3)
Module-2-MINING DATA STREAMS
17 pages
Module3A MiningBigDataStreams
No ratings yet
Module3A MiningBigDataStreams
145 pages
Understanding Data Stream Management
100% (1)
Understanding Data Stream Management
8 pages
Bigdata-Mining Data Streams
No ratings yet
Bigdata-Mining Data Streams
19 pages
Unit-Ii 30-1-24
No ratings yet
Unit-Ii 30-1-24
162 pages
Data Analytics and Visualization Unit-III
No ratings yet
Data Analytics and Visualization Unit-III
21 pages
Unit II (Big Data)
No ratings yet
Unit II (Big Data)
19 pages
Big Data Analytics - Unit 2 Notes
No ratings yet
Big Data Analytics - Unit 2 Notes
44 pages
Unit 4 Streaming Data
No ratings yet
Unit 4 Streaming Data
4 pages
Counting Distinct Elements in Streams
No ratings yet
Counting Distinct Elements in Streams
19 pages
Data Stream Processing Overview
No ratings yet
Data Stream Processing Overview
53 pages
Unit-II BDA
No ratings yet
Unit-II BDA
19 pages
Unit-II (Big Data)
No ratings yet
Unit-II (Big Data)
20 pages
Unit 4
No ratings yet
Unit 4
84 pages
Unit-2 BDA
No ratings yet
Unit-2 BDA
30 pages
Stream Processing for IT/CSE Students
No ratings yet
Stream Processing for IT/CSE Students
57 pages
Unit-2 Bda
No ratings yet
Unit-2 Bda
33 pages
FALLSEM2024-25 SWE2011 ETH VL2024250103282 2024-08-19 Reference-Material-I
No ratings yet
FALLSEM2024-25 SWE2011 ETH VL2024250103282 2024-08-19 Reference-Material-I
53 pages
BDA Unit-4
No ratings yet
BDA Unit-4
12 pages
Mining Data Streams in Data Analytics Refers To The Process of Extracting Useful Patterns
No ratings yet
Mining Data Streams in Data Analytics Refers To The Process of Extracting Useful Patterns
30 pages
Stream Processing in Big Data Analytics
No ratings yet
Stream Processing in Big Data Analytics
33 pages
Real-Time Data Stream Applications
No ratings yet
Real-Time Data Stream Applications
18 pages
UNIT-2 (Big Data)
No ratings yet
UNIT-2 (Big Data)
30 pages
TRabl StreamProcessing
No ratings yet
TRabl StreamProcessing
79 pages
Data Analytics Unit 3
No ratings yet
Data Analytics Unit 3
14 pages
Overview of Data Stream Processing
No ratings yet
Overview of Data Stream Processing
68 pages
Data Stream Management System Overview
No ratings yet
Data Stream Management System Overview
14 pages
Big Data Stream Mining
No ratings yet
Big Data Stream Mining
8 pages
Big Data Stream Processing Guide
No ratings yet
Big Data Stream Processing Guide
22 pages
Chapter 1-1
No ratings yet
Chapter 1-1
34 pages
Big Data 3rd Unit
No ratings yet
Big Data 3rd Unit
16 pages
T09 Data Streaming
No ratings yet
T09 Data Streaming
52 pages
Unit 4 Notes PDF
100% (2)
Unit 4 Notes PDF
27 pages
Unit 3-6
No ratings yet
Unit 3-6
14 pages
Real Time Data Stream Processing Engine
No ratings yet
Real Time Data Stream Processing Engine
13 pages
Data Stream in Data Analytics
No ratings yet
Data Stream in Data Analytics
4 pages
Understanding Stream Processing Basics
No ratings yet
Understanding Stream Processing Basics
15 pages
6 - Streaming Part 1
No ratings yet
6 - Streaming Part 1
44 pages
Kate Wilson
No ratings yet
Kate Wilson
27 pages
5 Unit
No ratings yet
5 Unit
5 pages
Bda M4
No ratings yet
Bda M4
57 pages
Big Data Notes
No ratings yet
Big Data Notes
37 pages
Unit 2
No ratings yet
Unit 2
10 pages
Understanding Data Streams
No ratings yet
Understanding Data Streams
10 pages
Lec 19
No ratings yet
Lec 19
23 pages
BDA Mod 3
No ratings yet
BDA Mod 3
57 pages
Stream Processing Chapter 5
No ratings yet
Stream Processing Chapter 5
23 pages
Stream Processing
No ratings yet
Stream Processing
70 pages
SA Unit 1 PPT 2
No ratings yet
SA Unit 1 PPT 2
27 pages
Lec 19
No ratings yet
Lec 19
24 pages
BDA Lec10
No ratings yet
BDA Lec10
33 pages
UNIT-3 (Mining Data Streams)
No ratings yet
UNIT-3 (Mining Data Streams)
50 pages
JyothsnaDST Unit-1 Extra
No ratings yet
JyothsnaDST Unit-1 Extra
25 pages
Real-Time Streaming for Tech Pros
No ratings yet
Real-Time Streaming for Tech Pros
5 pages
UNIT 3 Notes Data Analytics
No ratings yet
UNIT 3 Notes Data Analytics
136 pages
b0m33bdt 7p Spark Databricks Streaming - 2023 - en
No ratings yet
b0m33bdt 7p Spark Databricks Streaming - 2023 - en
50 pages
Big Data Analytics Unit-2
100% (1)
Big Data Analytics Unit-2
11 pages
Sample Short Proposal
No ratings yet
Sample Short Proposal
1 page
BDA Lab Manual New Latest
No ratings yet
BDA Lab Manual New Latest
94 pages
Pce (Employment Skill Lecture) Te
No ratings yet
Pce (Employment Skill Lecture) Te
7 pages
Natural Resources Evs (SE)
No ratings yet
Natural Resources Evs (SE)
6 pages
WMB-X230 Installation Manual
No ratings yet
WMB-X230 Installation Manual
80 pages
Sriram Hariharan Resume
No ratings yet
Sriram Hariharan Resume
2 pages
AChandiok BICA2018
No ratings yet
AChandiok BICA2018
26 pages
CG Lectures Stack
No ratings yet
CG Lectures Stack
53 pages
Izho 2019 D2
No ratings yet
Izho 2019 D2
1 page
GRANT OQ610 Series Temperature Data Logger Datasheet
No ratings yet
GRANT OQ610 Series Temperature Data Logger Datasheet
2 pages
Data Domain Student Guide
No ratings yet
Data Domain Student Guide
288 pages
SIH 2023 Final Results
No ratings yet
SIH 2023 Final Results
58 pages
Phase 4 of IT Infrastructure For ABC Brewing Company
No ratings yet
Phase 4 of IT Infrastructure For ABC Brewing Company
15 pages
Bocw - 2025-02-19T133713.544
No ratings yet
Bocw - 2025-02-19T133713.544
7 pages
AC Power Supply
No ratings yet
AC Power Supply
8 pages
Huawei Highly Scalable System Design
No ratings yet
Huawei Highly Scalable System Design
50 pages
E Book
No ratings yet
E Book
6 pages
Project QT Apk
No ratings yet
Project QT Apk
3 pages
Abb Acs 800 Parameter
100% (1)
Abb Acs 800 Parameter
7 pages
Resume For Freshers
No ratings yet
Resume For Freshers
2 pages
Enterprise Risk Management Guide
No ratings yet
Enterprise Risk Management Guide
13 pages
Release Notes
No ratings yet
Release Notes
14 pages
Explosion-Proof Twin Solenoid Ex23 049 B 23W X X DC 2 H Operating Manual EU Declaration of Conformity
No ratings yet
Explosion-Proof Twin Solenoid Ex23 049 B 23W X X DC 2 H Operating Manual EU Declaration of Conformity
16 pages
Campus Event Web
No ratings yet
Campus Event Web
26 pages
Data Obfuscation - Part 1
No ratings yet
Data Obfuscation - Part 1
37 pages
Kpi 300DW Bio - 20230615en
No ratings yet
Kpi 300DW Bio - 20230615en
2 pages
Asian Girl Me Us86
No ratings yet
Asian Girl Me Us86
4 pages
Chapter 4 - Database
No ratings yet
Chapter 4 - Database
22 pages
The Process Review Document For The Wargrave College
No ratings yet
The Process Review Document For The Wargrave College
25 pages
Data Structures Key Questions Guide
No ratings yet
Data Structures Key Questions Guide
3 pages
saveEditorPS4 en Manual
No ratings yet
saveEditorPS4 en Manual
34 pages
Project Attack Case Study
No ratings yet
Project Attack Case Study
6 pages
TP Link Omada Oc300 Controller
No ratings yet
TP Link Omada Oc300 Controller
2 pages
Smart Anti Theft Bicycle Locking System Project Final Report AS2019908
No ratings yet
Smart Anti Theft Bicycle Locking System Project Final Report AS2019908
30 pages

BDA MINING DATA STREAM NOTES New

Uploaded by

BDA MINING DATA STREAM NOTES New

Uploaded by

What is data stream .

1. What is a Data Stream?

Financial stock market feeds

Clickstreams from websites

Social media updates (tweets, likes, comments)

GPS signals from vehicles

2. Data Stream Model (in Big Data context)

Key Characteristics of Data Stream Model

1. Continuous arrival: Data keeps coming without end.

2. Real-time or near real-time processing: Must be analyzed as it arrives.

5. Approximation: Instead of exact results, systems may use sampling, summarization, or

3. Example in Big Data Context

👉 Suppose we have a bank’s fraud detection system.

Data stream model: Each transaction is analyzed immediately as it occurs.

4. Real-life Technologies for Data Streams

Apache Kafka → Messaging system for handling real-time streams.

Google Dataflow / AWS Kinesis → Cloud services for stream processing.

2. What is a Data Stream Management System (DSMS)?

Think of it as the streaming version of a Database Management System (DBMS):

2. Why DSMS is Needed?

Data streams are unbounded (no end).

Queries must run continuously (not just once).

Only a small fraction of data can be stored due to memory/space limits.

3. Key Features of DSMS

1. Continuous Queries (CQ):

Instead of running once, queries keep running on incoming streams.

Tumbling window (fixed, non-overlapping chunks)

3. Low Latency Processing:

Provides results in real time or near real time.

4. Integration with Sources & Sinks:

Can output results to dashboards, alerts, databases.

Academic systems: Aurora, STREAM, TelegraphCQ.

Apache Kafka Streams

Apache Spark Streaming

5. Example Use Case

DSMS continuously queries the stream to detect:

Average vehicle speed in the last 5 minutes.

Number of vehicles crossing a signal per minute.

1. Sensor & IoT Devices

Smart meters (electricity, water, gas)

Heart rate monitors / fitness bands

Vehicle GPS devices

Industrial IoT machines (predictive maintenance)

Online banking & credit card transactions

Cryptocurrency trading platforms

3. Web & User Interaction

Clickstream data (user clicks, searches, page visits on websites)

Online gaming events (player actions in real time)

E-commerce transactions (Amazon, Flipkart carts & purchases)

4. Social Media & Communication

Twitter/X tweets, likes, retweets (real-time firehose)

Facebook/Instagram posts, comments, live reactions

WhatsApp/Telegram message activity streams

5. System & Application Logs

Server logs (errors, requests, CPU usage)

Application logs (events happening continuously)

Security logs (login attempts, suspicious activities)

Audio streaming (Spotify, radio services)

Surveillance cameras (CCTV feeds)

7. Transportation & Mobility

Flight radar data (real-time flight positions)

Ride-hailing apps (Uber, Ola, Lyft driver/passenger locations)

Public transport tracking (bus/train GPS updates)

Works on static data (database tables).

Produces a one-time result.

Example (Bank Database):

✅ Runs once → produces a single result.

Registered once and stays active.

Works on dynamic, continuous streams.

Produces real-time or continuous results.

Example (Bank Data Stream):

Query (conceptual, in a DSMS like Apache Flink):

Continuously monitor transactions and alert if

Meaning: Fraud detection system keeps watching all new transactions.

✅ Runs continuously → generates alerts whenever condition is met.

Feature​ Ad Hoc Query​Standing Query

Feature Ad Hoc QueryStanding Query