ITECH WORLD AKTU 1
ITECH WORLD AKTU
Subject Name: Data Analytics
Subject Code: BCS052
UNIT 3: Mining Data Streams
Syllabus
1. Introduction to streams concepts
2. Stream data model and architecture
3. Stream computing
4. Sampling data in a stream
5. Filtering streams
6. Counting distinct elements in a stream
7. Estimating moments
8. Counting ones in a window
9. Decaying window
10. Real-time Analytics Platform (RTAP) applications
11. Case studies:
• Real-time sentiment analysis
• Stock market predictions
ITECH WORLD AKTU 2
1. Introduction to Streams Concepts
Definition: A data stream is a continuous and real-time flow of data elements made
available sequentially over time. Unlike traditional static datasets, data streams are
dynamic and require real-time processing and analysis to extract actionable insights.
Key Characteristics:
• Continuous Flow: Data streams are generated and processed continuously,
often without a defined start or end point.
• High Volume: Streams can produce a large amount of data per second, re-
quiring scalable systems to handle the load.
• Real-Time Processing: Due to their continuous nature, streams demand
real-time or near real-time analysis.
• Transient Data: Data in streams may not be stored permanently and could
be processed in memory or with sliding windows.
• Heterogeneity: Data streams can come from diverse sources and in varying
formats (structured, semi-structured, or unstructured).
Applications of Data Streams:
• Social Media Analytics: Monitoring platforms like Twitter or Instagram for
trending topics and sentiment analysis.
• IoT (Internet of Things): Devices like smart sensors in a factory transmit-
ting data about temperature, pressure, or performance in real-time.
• E-commerce: Streaming customer interactions on websites to offer dynamic
recommendations.
ITECH WORLD AKTU 3
• Finance: Real-time stock price analysis for investment decisions.
• Transportation: Monitoring traffic flows using data from GPS devices.
Example 1: Social Media Feeds
• Social media platforms like Twitter continuously generate streams of tweets.
These streams can be processed in real-time to identify trending hashtags,
analyze user sentiments, or detect breaking news.
Example 2: Sensor Data from IoT Devices
• Consider a smart home system where sensors monitor temperature, humidity,
and energy usage. These sensors send continuous data streams to a central
server, enabling real-time decisions like adjusting the thermostat or sending
alerts for anomalies.
Challenges in Handling Data Streams:
• Scalability: Systems need to scale dynamically to handle spikes in data vol-
ume.
• Latency: Minimizing the time between data arrival and actionable insight is
crucial for applications like fraud detection.
• Data Quality: Ensuring the accuracy and reliability of streaming data, which
might have noise or incomplete values.
2. Stream Data Model and Architecture
ITECH WORLD AKTU 4
Architecture of Stream Data Processing (Based on Diagram):
1. Data Streams:
• Continuous flow of data from various sources, such as sensors, social media,
or log files.
2. Stream Manager:
• Manages incoming data streams and forwards them to the processing com-
ponents.
3. System Catalog:
• Stores metadata about the system, such as stream schemas and resources
used for processing.
4. Scheduler:
• Allocates tasks and resources efficiently to process the incoming data
streams.
5. Router:
• Directs data streams to appropriate processing units or queues for further
operations.
6. Queue Manager:
• Manages buffers and organizes incoming data to ensure orderly processing.
7. Query Processor:
• Executes queries on the stream data and provides results in real-time.
8. Query Optimizer:
• Optimizes query execution plans for efficiency in processing large volumes
of streaming data.
9. QoS Monitoring (Quality of Service):
• Ensures data processing meets predefined performance metrics, such as
latency and throughput.
10. Storage Manager:
• Handles temporary or permanent storage of processed data, typically in
databases or file systems.
11. Secondary Storage:
• Used for long-term storage of data streams and historical data.
ITECH WORLD AKTU 5
3. Stream Computing
Definition: Stream computing refers to the continuous and real-time processing of
data streams as they arrive, without storing them for batch processing.
Key Features of Stream Computing:
• Continuous Processing: Data is processed on the fly as it streams in.
• Low Latency: Enables near real-time decision-making by minimizing process-
ing delays.
• Scalable: Handles large and variable volumes of data streams efficiently.
• Fault Tolerance: Ensures the system can recover from failures and continue
processing seamlessly.
Advantages of Stream Computing:
• Real-Time Insights: Enables immediate response to events, such as fraud
detection or anomaly identification.
• Dynamic Scalability: Adapts to fluctuating data volumes in applications
like IoT and e-commerce.
• Improved Resource Utilization: Processes only the required data without
storing entire datasets.
• Continuous Analytics: Provides ongoing analysis rather than waiting for
batch processes to complete.
• Supports Complex Workflows: Handles advanced operations like filtering,
aggregations, and joins in real-time.
Applications of Stream Computing:
• Fraud Detection: Monitors transactions in real-time for unusual patterns.
• IoT Applications: Processes sensor data from smart devices continuously.
• Healthcare: Analyzes patient vitals in real-time for critical care alerts.
• E-commerce: Tracks customer behavior to offer real-time recommendations.
• Social Media Analytics: Monitors and analyzes trends and user sentiments.
Example: Real-Time Traffic Monitoring
• GPS devices in vehicles stream location data to a central system.
• Stream computing processes this data to identify traffic congestion and suggest
alternative routes in real-time.
Example: Fraud detection in banking: Transaction streams are analyzed to detect
unusual patterns in real-time.
ITECH WORLD AKTU 6
4. Sampling Data in a Stream
Definition: Sampling in data streams involves selecting a representative subset of
the stream for analysis, especially when processing the entire stream is infeasible due
to high volume or resource constraints.
Importance of Sampling:
• Reduces computational load by working on a smaller subset of data.
• Helps in estimating patterns or trends without analyzing the full stream.
• Useful in scenarios where storage or processing capacity is limited.
Techniques for Sampling:
• Random Sampling:
– Data points are chosen randomly from the stream.
– Simple to implement but may not ensure uniform representation.
• Reservoir Sampling:
– Maintains a fixed-size sample from the stream by dynamically replacing
elements as new data arrives.
– Ensures uniform probability for each data point in the stream.
• Systematic Sampling:
– Picks every k th element from the stream after selecting a random starting
point.
– Useful when the stream has a repetitive structure or pattern.
• Stratified Sampling:
– Divides the stream into strata (subgroups) and samples proportionally
from each stratum.
– Ensures representation of all subgroups in the data.
• Priority Sampling:
– Assigns a priority to each data point based on importance or weight and
selects data with higher priorities.
Advantages of Sampling:
• Reduces memory and processing requirements.
• Allows real-time analytics on high-velocity streams.
• Enables quicker insights by focusing on relevant portions of the data.
ITECH WORLD AKTU 7
Applications of Sampling:
• Social Media Analysis: Sampling a subset of tweets to estimate trending
topics or sentiments.
• Network Traffic Monitoring: Analyzing a fraction of packets to detect
anomalies or patterns.
• Financial Data Streams: Sampling stock price streams to estimate market
trends without analyzing all data.
Example 1: Random Sampling in Sentiment Analysis
• From a stream of 1 million tweets, randomly select 10% to estimate the senti-
ment trends for a product launch.
Example 2: Reservoir Sampling in IoT
• A smart thermostat collects temperature readings continuously. Reservoir sam-
pling is used to maintain a fixed-size sample for quick diagnostics.
5. Filtering Streams
Definition: Filtering streams involves selecting or discarding elements from a data
stream based on specific criteria or conditions.
Techniques:
• Content-based Filtering: Filters data based on attributes or content (e.g.,
filter tweets with specific hashtags).
• Time-based Filtering: Filters data within a specific time range (e.g., log
entries in the last 24 hours).
• Probabilistic Filtering: Uses approximate methods like Bloom Filters to
test membership efficiently.
Example:
• Filter sensor data to only include temperature readings above 30°C.
6. Counting Distinct Elements in a Stream
Definition: Counting distinct elements identifies the number of unique items in a
data stream.
Challenges:
• High data volume makes storing all elements impractical.
• Efficient algorithms are needed to approximate the count.
ITECH WORLD AKTU 8
Techniques:
• Hashing: Hash elements to reduce memory usage.
• HyperLogLog Algorithm: Estimates the count using probabilistic data
structures with minimal memory.
Example:
• Count the number of unique IP addresses in network traffic logs.
7. Estimating Moments
Definition: Moments are statistical measures used to understand the properties of
a data distribution, such as variance or skewness.
Types of Moments:
• First Moment: The sum of elements in a stream.
• Second Moment: The sum of squares of elements, used for measuring vari-
ance.
Techniques:
• Alon-Matias-Szegedy (AMS) Algorithm: Efficiently estimates moments
using random sampling.
Example:
• Estimate the variance in transaction amounts in a financial data stream.
8. Counting Ones in a Window
Definition: Counting ones in a window involves tracking the number of ones (or
specific events) in a fixed-size window of the stream.
Techniques:
• Sliding Window: Maintains a fixed-size window and updates counts as new
data arrives.
• Exponential Decay: Older data has less influence on the count over time.
Example:
• Count the number of ”likes” in the last 5 minutes on a live video stream.
9. Decaying Window
ITECH WORLD AKTU 9
Definition: A decaying window assigns weights to elements in a stream based on
their age, giving more importance to recent data.
Advantages:
• Keeps analytics relevant to recent trends.
• Efficiently handles infinite streams by discarding older, less important data.
Techniques:
• Exponential Decay Function: Applies a decay factor e−λt , where t is the
age of the data.
Example:
• Monitor CPU usage, giving higher priority to recent data while reducing older
data’s impact.
10. Real-time Analytics Platform (RTAP) Applications
Definition: RTAP refers to platforms designed for processing and analyzing real-
time data streams.
Applications:
• Sentiment Analysis: Analyze social media streams to determine public opin-
ion in real-time.
• Stock Market Predictions: Process stock trade data to predict price move-
ments and trends.
• Fraud Detection: Monitor transactions to detect fraudulent activities in-
stantly.
• IoT Monitoring: Analyze sensor data from smart devices for immediate ac-
tion.
Key Features:
• Low latency for real-time decision-making.
• Scalable to handle high-velocity data streams.
• Integration with distributed systems (e.g., Apache Kafka, Spark Streaming).
Example:
• A bank uses RTAP to monitor millions of credit card transactions to detect
and block fraud in real-time.
ITECH WORLD AKTU 10
Case Studies
1. Real-Time Sentiment Analysis
Objective: Analyze real-time social media streams to understand and quantify
public sentiment.
Applications:
• Monitoring political sentiment during elections.
• Tracking customer feedback during product launches.
• Real-time reaction analysis to breaking news or crises.
Steps:
1. Data Collection:
• Use APIs such as Twitter API, Reddit API, or Facebook Graph API to
stream live data.
• Filter data using keywords, hashtags, geolocation, or user metadata.
• Example: Collect tweets containing hashtags like #Election2024 or #New-
ProductLaunch.
2. Data Preprocessing:
• Cleaning: Remove stopwords, special characters, emojis, URLs, and un-
necessary whitespace.
• Tokenization: Break text into individual words or phrases for analysis.
• Normalization: Convert text to lowercase, and apply stemming or lemma-
tization to unify word forms.
• Example: Convert ”Running” and ”Runs” to ”Run.”
3. Sentiment Analysis:
• Apply sentiment scoring algorithms to classify text as positive, negative,
or neutral.
• Popular Tools:
– VADER (Valence Aware Dictionary for Sentiment Reason-
ing): Ideal for analyzing short, informal text like tweets.
– TextBlob: Simple library for sentiment scoring and polarity detec-
tion.
– Deep Learning Models: Pre-trained models like BERT or GPT-3
for high accuracy in complex sentiment detection.
• Example: A tweet like ”This product is amazing!” is classified as positive
sentiment.
ITECH WORLD AKTU 11
4. Real-Time Processing:
• Use stream processing platforms such as Apache Kafka, Apache Flink, or
Spark Streaming for low-latency data processing.
• Continuously calculate sentiment scores and update visualizations or dash-
boards.
5. Visualization and Insights:
• Use visualization tools like Tableau, Power BI, or custom dashboards to
display trends.
• Example: Real-time sentiment heatmaps during a political debate or prod-
uct launch.
2. Stock Market Predictions
Objective: Analyze and predict stock price trends in real-time using market data
and news feeds.
Applications:
• Predicting stock price fluctuations based on breaking news.
• Identifying trading opportunities using historical and real-time data.
• Risk management through early detection of market volatility.
Steps:
1. Data Collection:
• Gather real-time stock data from APIs such as Yahoo Finance, Alpha
Vantage, or Bloomberg.
• Collect relevant financial news or tweets using web scraping tools or news
APIs.
• Example: Monitor the stock prices of companies like Apple and Tesla
while analyzing related news articles.
2. Data Preprocessing:
• Cleaning: Remove irrelevant data like stopwords, duplicate records, or
incomplete stock logs.
• Feature Engineering: Generate features such as moving averages, trad-
ing volume, and sentiment scores from news articles.
• Normalization: Scale stock prices and volumes to improve the perfor-
mance of prediction models.
3. Modeling and Prediction:
ITECH WORLD AKTU 12
• Use machine learning or deep learning models to predict price movements.
• Popular Models:
– Linear Regression: For modeling price trends over time.
– ARIMA (AutoRegressive Integrated Moving Average): For
time-series forecasting.
– LSTM (Long Short-Term Memory): Deep learning model for
sequential data like stock prices.
• Example: Predict whether a stock’s price will rise or fall based on the past
24 hours’ trends.
4. Real-Time Processing:
• Stream data using platforms like Apache Kafka or Spark Streaming.
• Continuously update predictions based on new stock prices or news feeds.
5. Visualization and Insights:
• Display stock trends, predicted price movements, and trading signals on
a dashboard.
• Example: A real-time graph showing Tesla’s price prediction for the next
15 minutes.