0% found this document useful (0 votes)
29 views12 pages

Data Analytics Chapter 3

The document outlines the syllabus for a Data Analytics course, focusing on data stream mining, including concepts, models, and applications. It covers various techniques for processing data streams in real-time, such as sampling, filtering, and counting distinct elements, along with their challenges and advantages. Additionally, it presents case studies on real-time sentiment analysis and stock market predictions, detailing the steps involved in data collection, preprocessing, modeling, and visualization.

Uploaded by

Genius Shivam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views12 pages

Data Analytics Chapter 3

The document outlines the syllabus for a Data Analytics course, focusing on data stream mining, including concepts, models, and applications. It covers various techniques for processing data streams in real-time, such as sampling, filtering, and counting distinct elements, along with their challenges and advantages. Additionally, it presents case studies on real-time sentiment analysis and stock market predictions, detailing the steps involved in data collection, preprocessing, modeling, and visualization.

Uploaded by

Genius Shivam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

ITECH WORLD AKTU 1

ITECH WORLD AKTU


Subject Name: Data Analytics
Subject Code: BCS052

UNIT 3: Mining Data Streams


Syllabus
1. Introduction to streams concepts

2. Stream data model and architecture

3. Stream computing

4. Sampling data in a stream

5. Filtering streams

6. Counting distinct elements in a stream

7. Estimating moments

8. Counting ones in a window

9. Decaying window

10. Real-time Analytics Platform (RTAP) applications

11. Case studies:

• Real-time sentiment analysis


• Stock market predictions
ITECH WORLD AKTU 2

1. Introduction to Streams Concepts

Definition: A data stream is a continuous and real-time flow of data elements made
available sequentially over time. Unlike traditional static datasets, data streams are
dynamic and require real-time processing and analysis to extract actionable insights.
Key Characteristics:

• Continuous Flow: Data streams are generated and processed continuously,


often without a defined start or end point.

• High Volume: Streams can produce a large amount of data per second, re-
quiring scalable systems to handle the load.

• Real-Time Processing: Due to their continuous nature, streams demand


real-time or near real-time analysis.

• Transient Data: Data in streams may not be stored permanently and could
be processed in memory or with sliding windows.

• Heterogeneity: Data streams can come from diverse sources and in varying
formats (structured, semi-structured, or unstructured).

Applications of Data Streams:

• Social Media Analytics: Monitoring platforms like Twitter or Instagram for


trending topics and sentiment analysis.

• IoT (Internet of Things): Devices like smart sensors in a factory transmit-


ting data about temperature, pressure, or performance in real-time.

• E-commerce: Streaming customer interactions on websites to offer dynamic


recommendations.
ITECH WORLD AKTU 3

• Finance: Real-time stock price analysis for investment decisions.

• Transportation: Monitoring traffic flows using data from GPS devices.

Example 1: Social Media Feeds

• Social media platforms like Twitter continuously generate streams of tweets.


These streams can be processed in real-time to identify trending hashtags,
analyze user sentiments, or detect breaking news.

Example 2: Sensor Data from IoT Devices

• Consider a smart home system where sensors monitor temperature, humidity,


and energy usage. These sensors send continuous data streams to a central
server, enabling real-time decisions like adjusting the thermostat or sending
alerts for anomalies.

Challenges in Handling Data Streams:

• Scalability: Systems need to scale dynamically to handle spikes in data vol-


ume.

• Latency: Minimizing the time between data arrival and actionable insight is
crucial for applications like fraud detection.

• Data Quality: Ensuring the accuracy and reliability of streaming data, which
might have noise or incomplete values.

2. Stream Data Model and Architecture


ITECH WORLD AKTU 4

Architecture of Stream Data Processing (Based on Diagram):

1. Data Streams:

• Continuous flow of data from various sources, such as sensors, social media,
or log files.

2. Stream Manager:

• Manages incoming data streams and forwards them to the processing com-
ponents.

3. System Catalog:

• Stores metadata about the system, such as stream schemas and resources
used for processing.

4. Scheduler:

• Allocates tasks and resources efficiently to process the incoming data


streams.

5. Router:

• Directs data streams to appropriate processing units or queues for further


operations.

6. Queue Manager:

• Manages buffers and organizes incoming data to ensure orderly processing.

7. Query Processor:

• Executes queries on the stream data and provides results in real-time.

8. Query Optimizer:

• Optimizes query execution plans for efficiency in processing large volumes


of streaming data.

9. QoS Monitoring (Quality of Service):

• Ensures data processing meets predefined performance metrics, such as


latency and throughput.

10. Storage Manager:

• Handles temporary or permanent storage of processed data, typically in


databases or file systems.

11. Secondary Storage:

• Used for long-term storage of data streams and historical data.


ITECH WORLD AKTU 5

3. Stream Computing

Definition: Stream computing refers to the continuous and real-time processing of


data streams as they arrive, without storing them for batch processing.
Key Features of Stream Computing:

• Continuous Processing: Data is processed on the fly as it streams in.

• Low Latency: Enables near real-time decision-making by minimizing process-


ing delays.

• Scalable: Handles large and variable volumes of data streams efficiently.

• Fault Tolerance: Ensures the system can recover from failures and continue
processing seamlessly.

Advantages of Stream Computing:

• Real-Time Insights: Enables immediate response to events, such as fraud


detection or anomaly identification.

• Dynamic Scalability: Adapts to fluctuating data volumes in applications


like IoT and e-commerce.

• Improved Resource Utilization: Processes only the required data without


storing entire datasets.

• Continuous Analytics: Provides ongoing analysis rather than waiting for


batch processes to complete.

• Supports Complex Workflows: Handles advanced operations like filtering,


aggregations, and joins in real-time.

Applications of Stream Computing:

• Fraud Detection: Monitors transactions in real-time for unusual patterns.

• IoT Applications: Processes sensor data from smart devices continuously.

• Healthcare: Analyzes patient vitals in real-time for critical care alerts.

• E-commerce: Tracks customer behavior to offer real-time recommendations.

• Social Media Analytics: Monitors and analyzes trends and user sentiments.

Example: Real-Time Traffic Monitoring

• GPS devices in vehicles stream location data to a central system.

• Stream computing processes this data to identify traffic congestion and suggest
alternative routes in real-time.

Example: Fraud detection in banking: Transaction streams are analyzed to detect


unusual patterns in real-time.
ITECH WORLD AKTU 6

4. Sampling Data in a Stream

Definition: Sampling in data streams involves selecting a representative subset of


the stream for analysis, especially when processing the entire stream is infeasible due
to high volume or resource constraints.
Importance of Sampling:

• Reduces computational load by working on a smaller subset of data.

• Helps in estimating patterns or trends without analyzing the full stream.

• Useful in scenarios where storage or processing capacity is limited.

Techniques for Sampling:

• Random Sampling:

– Data points are chosen randomly from the stream.


– Simple to implement but may not ensure uniform representation.

• Reservoir Sampling:

– Maintains a fixed-size sample from the stream by dynamically replacing


elements as new data arrives.
– Ensures uniform probability for each data point in the stream.

• Systematic Sampling:

– Picks every k th element from the stream after selecting a random starting
point.
– Useful when the stream has a repetitive structure or pattern.

• Stratified Sampling:

– Divides the stream into strata (subgroups) and samples proportionally


from each stratum.
– Ensures representation of all subgroups in the data.

• Priority Sampling:

– Assigns a priority to each data point based on importance or weight and


selects data with higher priorities.

Advantages of Sampling:

• Reduces memory and processing requirements.

• Allows real-time analytics on high-velocity streams.

• Enables quicker insights by focusing on relevant portions of the data.


ITECH WORLD AKTU 7

Applications of Sampling:

• Social Media Analysis: Sampling a subset of tweets to estimate trending


topics or sentiments.

• Network Traffic Monitoring: Analyzing a fraction of packets to detect


anomalies or patterns.

• Financial Data Streams: Sampling stock price streams to estimate market


trends without analyzing all data.

Example 1: Random Sampling in Sentiment Analysis

• From a stream of 1 million tweets, randomly select 10% to estimate the senti-
ment trends for a product launch.

Example 2: Reservoir Sampling in IoT

• A smart thermostat collects temperature readings continuously. Reservoir sam-


pling is used to maintain a fixed-size sample for quick diagnostics.

5. Filtering Streams

Definition: Filtering streams involves selecting or discarding elements from a data


stream based on specific criteria or conditions.
Techniques:

• Content-based Filtering: Filters data based on attributes or content (e.g.,


filter tweets with specific hashtags).

• Time-based Filtering: Filters data within a specific time range (e.g., log
entries in the last 24 hours).

• Probabilistic Filtering: Uses approximate methods like Bloom Filters to


test membership efficiently.

Example:

• Filter sensor data to only include temperature readings above 30°C.

6. Counting Distinct Elements in a Stream

Definition: Counting distinct elements identifies the number of unique items in a


data stream.
Challenges:

• High data volume makes storing all elements impractical.

• Efficient algorithms are needed to approximate the count.


ITECH WORLD AKTU 8

Techniques:

• Hashing: Hash elements to reduce memory usage.

• HyperLogLog Algorithm: Estimates the count using probabilistic data


structures with minimal memory.

Example:

• Count the number of unique IP addresses in network traffic logs.

7. Estimating Moments

Definition: Moments are statistical measures used to understand the properties of


a data distribution, such as variance or skewness.
Types of Moments:

• First Moment: The sum of elements in a stream.

• Second Moment: The sum of squares of elements, used for measuring vari-
ance.

Techniques:

• Alon-Matias-Szegedy (AMS) Algorithm: Efficiently estimates moments


using random sampling.

Example:

• Estimate the variance in transaction amounts in a financial data stream.

8. Counting Ones in a Window

Definition: Counting ones in a window involves tracking the number of ones (or
specific events) in a fixed-size window of the stream.
Techniques:

• Sliding Window: Maintains a fixed-size window and updates counts as new


data arrives.

• Exponential Decay: Older data has less influence on the count over time.

Example:

• Count the number of ”likes” in the last 5 minutes on a live video stream.

9. Decaying Window
ITECH WORLD AKTU 9

Definition: A decaying window assigns weights to elements in a stream based on


their age, giving more importance to recent data.
Advantages:

• Keeps analytics relevant to recent trends.

• Efficiently handles infinite streams by discarding older, less important data.

Techniques:

• Exponential Decay Function: Applies a decay factor e−λt , where t is the


age of the data.

Example:

• Monitor CPU usage, giving higher priority to recent data while reducing older
data’s impact.

10. Real-time Analytics Platform (RTAP) Applications

Definition: RTAP refers to platforms designed for processing and analyzing real-
time data streams.
Applications:

• Sentiment Analysis: Analyze social media streams to determine public opin-


ion in real-time.

• Stock Market Predictions: Process stock trade data to predict price move-
ments and trends.

• Fraud Detection: Monitor transactions to detect fraudulent activities in-


stantly.

• IoT Monitoring: Analyze sensor data from smart devices for immediate ac-
tion.

Key Features:

• Low latency for real-time decision-making.

• Scalable to handle high-velocity data streams.

• Integration with distributed systems (e.g., Apache Kafka, Spark Streaming).

Example:

• A bank uses RTAP to monitor millions of credit card transactions to detect


and block fraud in real-time.
ITECH WORLD AKTU 10

Case Studies
1. Real-Time Sentiment Analysis

Objective: Analyze real-time social media streams to understand and quantify


public sentiment.
Applications:

• Monitoring political sentiment during elections.

• Tracking customer feedback during product launches.

• Real-time reaction analysis to breaking news or crises.

Steps:

1. Data Collection:

• Use APIs such as Twitter API, Reddit API, or Facebook Graph API to
stream live data.
• Filter data using keywords, hashtags, geolocation, or user metadata.
• Example: Collect tweets containing hashtags like #Election2024 or #New-
ProductLaunch.

2. Data Preprocessing:

• Cleaning: Remove stopwords, special characters, emojis, URLs, and un-


necessary whitespace.
• Tokenization: Break text into individual words or phrases for analysis.
• Normalization: Convert text to lowercase, and apply stemming or lemma-
tization to unify word forms.
• Example: Convert ”Running” and ”Runs” to ”Run.”

3. Sentiment Analysis:

• Apply sentiment scoring algorithms to classify text as positive, negative,


or neutral.
• Popular Tools:
– VADER (Valence Aware Dictionary for Sentiment Reason-
ing): Ideal for analyzing short, informal text like tweets.
– TextBlob: Simple library for sentiment scoring and polarity detec-
tion.
– Deep Learning Models: Pre-trained models like BERT or GPT-3
for high accuracy in complex sentiment detection.
• Example: A tweet like ”This product is amazing!” is classified as positive
sentiment.
ITECH WORLD AKTU 11

4. Real-Time Processing:

• Use stream processing platforms such as Apache Kafka, Apache Flink, or


Spark Streaming for low-latency data processing.
• Continuously calculate sentiment scores and update visualizations or dash-
boards.

5. Visualization and Insights:

• Use visualization tools like Tableau, Power BI, or custom dashboards to


display trends.
• Example: Real-time sentiment heatmaps during a political debate or prod-
uct launch.

2. Stock Market Predictions

Objective: Analyze and predict stock price trends in real-time using market data
and news feeds.
Applications:

• Predicting stock price fluctuations based on breaking news.

• Identifying trading opportunities using historical and real-time data.

• Risk management through early detection of market volatility.

Steps:

1. Data Collection:

• Gather real-time stock data from APIs such as Yahoo Finance, Alpha
Vantage, or Bloomberg.
• Collect relevant financial news or tweets using web scraping tools or news
APIs.
• Example: Monitor the stock prices of companies like Apple and Tesla
while analyzing related news articles.

2. Data Preprocessing:

• Cleaning: Remove irrelevant data like stopwords, duplicate records, or


incomplete stock logs.
• Feature Engineering: Generate features such as moving averages, trad-
ing volume, and sentiment scores from news articles.
• Normalization: Scale stock prices and volumes to improve the perfor-
mance of prediction models.

3. Modeling and Prediction:


ITECH WORLD AKTU 12

• Use machine learning or deep learning models to predict price movements.


• Popular Models:
– Linear Regression: For modeling price trends over time.
– ARIMA (AutoRegressive Integrated Moving Average): For
time-series forecasting.
– LSTM (Long Short-Term Memory): Deep learning model for
sequential data like stock prices.
• Example: Predict whether a stock’s price will rise or fall based on the past
24 hours’ trends.

4. Real-Time Processing:

• Stream data using platforms like Apache Kafka or Spark Streaming.


• Continuously update predictions based on new stock prices or news feeds.

5. Visualization and Insights:

• Display stock trends, predicted price movements, and trading signals on


a dashboard.
• Example: A real-time graph showing Tesla’s price prediction for the next
15 minutes.

You might also like